Rbalasub writeup of Banko et al.

From Cohen Courses
Jump to navigationJump to search

A review of Banko_2007_open_information_extraction_from_the_web by user:rbalasub

The authors state that domain specific IE systems that work on homogenous data and rely on heavy linguistic methods do not scale well to corpora on the scale of the Web. They present a new open IE system TextRunner which is compared to KnowItAll. The obvious improvement over KnowItAll which requires relation names and uses POS taggers is the open nature which obviates the need to specify a priori a list of relations that we are interested in.

  • Trainer: Use a dependency parser to self-train a NB classifier.
  • Extractor: The extractor uses POS taggers and a light weight NP chunker, applies heuristics and then uses the trained classifier to decide if the relation should be retained.
  • Redundancy Assesor: Estimates a probability that an extracted relation holds based on normalizing relations and maintaining counts of relations seen.

This paper in summary describes the TextRunner Open IE system and shows results that compares it favorably to KnowItAll.