Liuliu writeup of Banko 2007

From Cohen Courses
Jump to navigationJump to search

This is a review of Banko_2007_open_information_extraction_from_the_web by user:Liuliu.

As shown in the title, the two main questions answered by this paper is: (1) how to do information extraction in an open domain? (2) How to do information extraction efficiently in a large web scale? They created a new system called TextRunner which only runs through all the texts once and extract relations without prior human defined patterns.

  • For the first problem: TextRunner learns relation patterns from a small corpus in a self-supervised way. It doesn't require any human labeling but extracting relations itself by using some linguistics tools, such as dependency parsing. A Naive Bayes Classifier is trained using the extracted tuples based on a set of simple features.
  • For the second problem: The extraction phase of TextRunner doesn't rely on heavy linguistic knowledge but just some simple surface level features which are used for classification. It only makes a single pass over the whole corpus.

They also introduced two practical problems of the TextRunner: how to remove redundancy and how to speed up querying. Simple methods are used for remove redundancy, however word sense disambiguition is a big problem for this step. They use inverted indexing as in IR to support fast querying of extracted relations.

Results show its performance is better than the start-of-the-art KnowItAll system interms of average error rate. They also made some analysis on the set of extracted patterns. I like the figure illustrating different kinds of relations. However, I think the redundancy problem is much severe than they estimated.