Sgardine writesup Brin 1999

From Cohen Courses
Jump to navigationJump to search

This is a review of the paper Brin_1999_extracting_patterns_and_relations_from_the_world_wide_web by User:Sgardine

Summary

The stated approach is to

  1. begin with a small seed set of known tuples (in this case the tuples are author-title pairings of known extant books)
  2. search for documents where the pairings co-occur
  3. extract from the documents the patterns of co-occurrence
  4. search for documents where the patterns occur
  5. use the patterns on those documents to add more tuples
  6. iterate until you're satisfied with your quantity of tuples

Most of the difficulty lies in generating an appropriate set of patterns from documents where tuples occur. A metric of pattern specificity is introduced and a threshold established to reject too-general rules. The system is run and the results qualitatively examined. Unsurprisingly, the algorithm is quite sensitive to the contents of the seed set, and the results become quite noisy as many non-books are added as tuples.

Commentary

Since only the seed tuples are actually known true, the insights of MIL (as we saw in Bunescu 2007) would apply here as well -- I guess that paper refines how tuples are added. It would be interesting to see how much more general that allows rules to be (increasing recall) when we add more sophisticated methods for pruning the extracted tuples.

I like that the online list of books is labelled "My favorite books" -- generally interesting historically to see Google referred to as purely as research project