Banko et al IJCAI 2007

From Cohen Courses
Jump to navigationJump to search


Etzioni, O., Banko, M., Soderland, S., and Weld, D. S. 2008. Open information extraction from the web. Commun. ACM 51, 12 (Dec. 2008), 68-74.

Online version

ACM Digital Library


This paper paper describes an Open Information Extraction system, TextRunner, on a corpus of 9 million web pages. Usually relations have to be predefined before query time in IE systems, this paper addressed this issue using a automatic system. This paper resolved some issues in a previous system, KNOWITALL. There are three important component in TEXTRUNNER:

  • Self-supervised Learner
    • A unlexicalized parser (Klein & Manning, ACL 2003) was used to generated training data for a Naive Bayes classifier.
  • Extraction
    • Sentences are tagged and shallow parsed using OpenNLP tools. Noun phrases are entities candidates and relations are a sequence of words between two entities selected using heuristics.
  • Acessment
    • Only relations with high likelihood will be selected. The likelihood was estimated using Urn models proposed in (Downey et al, IJCAI 2005).

Related papers

This papers compares TEXTRUNNER (Banko et al, IJCAI 2007) with KNOWITALL (Etzioni et al, AI 2005). It was later used in a more general task of open domain information extraction task in Wu and Weld, ACL 2010.