Yandongl writeup of Brin 1999

From Cohen Courses
Jump to navigationJump to search

This is a review of the paper Brin_1999_extracting_patterns_and_relations_from_the_world_wide_web by user:Yandongl.

This paper proposes DIPRE which utilizes the duality between patterns and relations for the task of extracting patterns and relations over the Web. One specific problem - book relations - is studied, and the relation assessed is the author-title pair. Precision is more important than recall, since a high error rate is not acceptable.

Authors start with a small sample (5 examples), find occurrences, generate patterns, and then find more instances according to patterns generated. Iterate this process for more pattern extraction. One problem is that bogus items might be extracted which might cause the extraction off-topic. This can be controlled, however.

Patterns are defined as a set of tuples and heuristics is exploited for generating patterns. Patterns can't be too general to avoid producing non-books.

One problem is that, since the 5 seed books are not representative enough, most patterns extracted are scifi books.

The techniques used in this paper are simple and they seem to work well. However, today's Web has changed significantly (scale, structure, etc.) and those techniques might not apply any more.