Fader et al EMNLP 2011

From Cohen Courses
Revision as of 13:45, 13 October 2011 by Brendan (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigationJump to search

Citation

Fader, A., Soderland, S. and Etzioni, O. 2011. Identifying Relations for Open Information Extraction. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing.

Online version

University of Washington

Summary

This paper introduces REVERB, an Open Information Extraction system which outperforms in precision and recall to previous state-of-the-art extractors such as TEXTRUNNER and WOE. Two frequent types of errors in previous systems motivated this new extractor: incoherent extractions (where the extracted phrase has no meaningful interpretation) and uninformative extractions (where the extractions omit critical information). REVERB articulates two simple constraints on how binary relationships are expressed to avoid these problems.

A syntactic constraint is proposed to avoid incoherent relation phrases: a valid relation phrase should be either a verb, a verb followed by a preposition or a verb followed by nouns, adjectives or adverbs ending in a preposition. This constraint also reduces uninformative extractions but sometimes match relation phrases that are too specific and result in few instances. A lexical constraint is introduced to overcome this limitation: a valid relation phrase should take many distinct arguments in a large corpus. REVERB processes sentences using OpenNLP tools and then extracts valid relation phrases that satisfy both constraints.

Brief description of the extraction algorithm

The new extraction algorithm takes as input a POS-tagged and NP-chunked sentence and returns a set of extraction triples. Given an input sentence s, the algorithm performs two steps:

  • For each verb v in s, find the longest sequence of words rv such that rv starts at v and rv satisfies both the syntactic and the lexical constraints. If any pair of matches are adjacent or overlap in s, merge them into a single match.
  • For each relation phrase rv, find the nearest noun phrase x to the left, such that x is not a relative pronoun, WHO-pronoun or existential. Then, find the nearest noun phrase y to the right. For every pair found, return as a valid extraction.

For example, the sentence:

Hudson was born in Hampstead, which is a suburb of London.

returns two valid extractions:

e1: <Hudson, was born in, Hampstead>
e2: <Hampstead, is a suburb of, London>

Experimental results

Since REVERB uses a specified model of relations for extraction, it requires labeled data only for assigning confidence scores to its results. Therefore, it uses two orders of magnitude fewer training examples for learning the confidence function: a training dataset was created by manually labeling the extractions from a set of 1,000 random sentences from the Web and Wikipedia as correct or incorrect.

Five different systems were compared against REVERB: REVERB-lex (a version of REVERB without the lexical constraint), TEXTRUNNER, TEXTRUNNER-R (a modification of TEXTRUNNER retrained on REVERB extractions), WOE-pos (the CRF version of WOE) and WOE-parse (the dependency path version). For testing purposes, a test set of 500 sentences was sampled from the Web and each extraction from each system was evaluated by two human judges independently. The following graph shows the AUC (area under the precision-recall curve) results from the 86% of the data where the judges agreed:

ResultsGraph.png

Related papers

REVERB is compared against two other open IE systems: TEXTRUNNER, described in Banko et al IJCAI 2007 and WOE (-pos and -parse), presented in Wu and Weld ACL 2010.

Comment

They've released the software and some outputs from the system. It is neat. --Brendan 18:45, 13 October 2011 (UTC)