Yandongl writeup of Banko 2007
This is a review of Banko_2007_open_information_extraction_from_the_web by user:Yandongl.
This paper introduces a new information extraction system - TEXTRUNNER, which features:
(1) works on open domains
(2) extract through unsupervised learning
(3) high efficiency
Like other traditional IE systems such as KnowItAll, TEXTRUNNER also starts with a small number of training examples. What makes it different is TEXTRUNNER then is able to extract facts and relations automatically. It does this by labeling examples by itself and build a Naive Bayes classifier, which will be used by the Extractor module. In addition, unlike most other IE system that often use linguistics parser to help extraction, Extractor doesn't utilize a parse. It recognizes candidate tuples directly at one time, and thus to be efficient and is able to scale to Web.
In order to compare with KnowItAll, authors limit TEXTRUNNER's relations to be fixed. It reduces the error rate by 33% while generating about the same number of facts. The estimating distinct facts part, however, is tricky. Although the authors proposed a set of techniques such as removing leading/trailing words, 92% of distinctness is still too much overestimated, in my opinion. We know that too many words can have the same sense. Without deep linguistics analysis, it is hard to get an (even rough) estimation of the number of distinct extraction.