Wka writeup of Banko 2007

From Cohen Courses
Jump to navigationJump to search

This is a review of banko_2007_open_information_extraction_from_the_web by user:wka.

The authors present the paradigm of open information extraction (OIE) in which large sets of relation tuples are extracted without requiring any human input, as well as the TextRunner system, a complete OIE system that can handle relational user queries. The system is an efficiency improvement on their previous KnowItAll system.

TextRunner consists of 3 main modules:

  • The Self-supervised learner: trains a NB using its self-labeling its training data as positive / negative
  • The single-pass extractor
  • The redundancy-based assessor: uses the number of distinct sentences from which a pattern was extracted to estimate its probability of correctness uses their earlier Urns model.

Evaluating their results:

  • Correctness: Tuple is in well-formed relation -> entities in tuple are well-formed; classify according to concrete/abstract
  • Number of distinct facts: detect when 2 relations are synonymous; merge relations with little differences