Banko et al IJCAI 2007
From Cohen Courses
Revision as of 15:59, 24 September 2010 by PastStudents (talk | contribs) (Created page with '== Citation == Etzioni, O., Banko, M., Soderland, S., and Weld, D. S. 2008. Open information extraction from the web. Commun. ACM 51, 12 (Dec. 2008), 68-74. == Online version =…')
Citation
Etzioni, O., Banko, M., Soderland, S., and Weld, D. S. 2008. Open information extraction from the web. Commun. ACM 51, 12 (Dec. 2008), 68-74.
Online version
Summary
This paper paper describes a open information extraction system, TextRunner, on a corpus of 9 million web pages. Usually relations have to be predefined before query time in IE systems, this paper addressed this issue using a automatic system. This paper resolved some issues in a previous system, KNOWITALL. There are three important component in TEXTRUNNER:
- Self-supervised Learner
- A unlexicalized parser (Klein & Manning, ACL 2003) was used to generated training data for a Naive Bayes classifier.
- Extraction
- Sentences are tagged and shallow parsed using OpenNLP tools. Noun phrases are entities candidates and relations are a sequence of words between two entities selected using heuristics.
- Acessment
- Only relations with high likelihood will be selected. The likelihood was estimated using Urn models proposed in (Downey et al, IJCAI 2005).
Related papers
This papers compares TEXTRUNNER with KNOWITALL (Etzioni et al, AI 2005). was later used in a more general task of open domain information extraction task in Wu and Weld, ACL 2010.