Suranah writeup for Brin 1999
This is a review of brin_1999_extracting_patterns_and_relations_from_the_world_wide_web by user:Suranah.
The paper applies boot strapping on some basic patterns to extract author, title pairs from the web. As with any other bootstrapping approach, the focus is trying to maximize precision, and rely on the scalability of the web for recall. I am not sure, but this maybe one of the earliest applications of bootstrapping for IE on the web, which maybe in part due to the available infrastructure and resources the authors had. Needless to say, the paper has several shortcomings both computational and related to the (very) limited evaluation.
I found their drawing of the connection to LSI interesting. Also, different weights assigned to different patterns maybe able to model various genre, or may even succumb to the noise (like classifying article as a book, and the pattern getting propagated) as there is no ground truth / held out data to calibrate the weights.