Sgopal1 writeup of Extracting patterns from WWW
This is a review of the paper Brin_1999_extracting_patterns_and_relations_from_the_world_wide_web by user:sgopal1.
In this paper, the author describes some preliminary work on relation extraction from web corpus. The main motivation for the problem is that the Web has extremely huge amount of partially structured data which can be extracted. Brin focuses on extracting (author,title) pairs. The main idea is pretty simple, to have a boot-strapping based approach for alternatively generating patterns and relations. At each iteration one of them is kept fixed and the others one is extracted/refined. Brin fixes a particular forms of patters which contain information about url, prefix, suffix etc. He also wants to add patterns with mainly high precision, because recall is not a problem when the corpus is Huge ( It performs reasonably well ). Because of computational issues, the patterns are simple. Criticism
- I dont think randomly looking at 20 documents is convincing.
- He states that specificity ~ log( P(..) ). But in the equation he used a product of terms. Should'nt he use a summation ? ( because it is log ) unless the probability [ p(x) ] is in the form of x raised x or something similar.