Sgardine writesup Brin 1999

This is a review of the paper Brin_1999_extracting_patterns_and_relations_from_the_world_wide_web by User:Sgardine

Summary

The stated approach is to

begin with a small seed set of known tuples (in this case the tuples are author-title pairings of known extant books)
search for documents where the pairings co-occur
extract from the documents the patterns of co-occurrence
search for documents where the patterns occur
use the patterns on those documents to add more tuples
iterate until you're satisfied with your quantity of tuples

Most of the difficulty lies in generating an appropriate set of patterns from documents where tuples occur. A metric of pattern specificity is introduced and a threshold established to reject too-general rules. The system is run and the results qualitatively examined. Unsurprisingly, the algorithm is quite sensitive to the contents of the seed set, and the results become quite noisy as many non-books are added as tuples.

Commentary

Since only the seed tuples are actually known true, the insights of MIL (as we saw in Bunescu 2007) would apply here as well -- I guess that paper refines how tuples are added. It would be interesting to see how much more general that allows rules to be (increasing recall) when we add more sophisticated methods for pruning the extracted tuples.

I like that the online list of books is labelled "My favorite books" -- generally interesting historically to see Google referred to as purely as research project

Sgardine writesup Brin 1999

Summary

Commentary

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

Tools