Nschneid writeup of Brin 1999
This is Nschneid's review of Brin_1999_extracting_patterns_and_relations_from_the_world_wide_web
Semisupervised learning of patterns and relations via bootstrapping. Patterns for (author,title) pair extraction are regexps of the form <prefix> <author> <middle> <title> <suffix> or <prefix> <title> <middle> <author> <suffix>, coupled with a URL prefix.
Heuristics for pattern extraction (§4.3) given a set of occurrences over multiple pages:
- Group instances according to their middle portion and the order of author with respect to title.
- For each group, use in the pattern the longest common prefix of all the URLs, the longest common suffix of the instance prefixes for the pattern prefix, and the longest common prefix of the instance suffixes for the pattern suffix.
To control this expansion and avoid overly general patterns, for each candidate pattern a measure proportional to the total character length of the pattern and the number of matched instances is calculated and subjected to a threshold. This encourages a compromise between specificity (favoring long/detailed descriptions) and generality (favoring many matching instances).
Presumably there was no large labeled data set when this was written, so evaluation is qualitative. Seemed to work well, especially when HTML document structure could be leveraged.
- Are there better ways to evaluate a system for this type of task absent any gold standard data?