Mnduong writeup of Banko et al. 2007

From Cohen Courses
Jump to navigationJump to search

This is a review of Banko_2007_open_information_extraction_from_the_web by user:mnduong.

  • This paper introduces Open Information Extraction and TextRunner, an OIE system.
  • Open Information Extraction differs from closed IE systems in that they don't assume a given list of target relations to extract. Extraction at the Web scale also does not allow the use of dependency parsers or NERs, at least not at extraction time.
  • TextRunner has 3 subsystems: the Learner, the Extractor and the Assessor. The Learner uses a sample of several thousand sentences to train a Naive Bayes classifier. It first parses each sentence, then takes each pair of base noun phrases as candidate arguments to a relation. It uses some heuristic criteria to classify these candidate relations as positive or negative. After this self-labeling process, it extracts shallow features from the examples and trains a Naive Bayes classifier. The Extractor uses a POS tagger and noun phrase chunker to process the entire text, then uses this classifier to score the extracted tuples from each sentence, keeping the highest-scoring one only. Finally, the Assessor assigns a probability to each extracted relation using the number of sentences in which it appears.
  • The experiments compared TextRunner to KnowItAll, a system that requires prior knowledge of the relations to extract. TextRunner was found to achieve higher accuracy and similar recall to that of KnowItAll after going through 9 million webpages.

Questions/Comments:

  • If the entire process is unsupervised, without the need for hand-tagged labels, how was the accuracy calculated in the comparison against KnowItAll?
  • "Still, a large proportion of the errors of both systems were from noun phrase analysis, where arguments were truncated or stray words added. It is difficult to find extraction bound- aries accurately when the intended type of arguments such as company names, person names, or book titles is not specified to the system." Isn't the KnowItAll system supposed to know these arguments in advance?
  • In general, the paper is quite easy to follow, with clearly pointed out differences from earlier systems/paradigm.