Difference between revisions of "Fader et al EMNLP 2011"

Latest revision as of 13:45, 13 October 2011

Citation

Fader, A., Soderland, S. and Etzioni, O. 2011. Identifying Relations for Open Information Extraction. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing.

Online version

University of Washington

Summary

This paper introduces REVERB, an Open Information Extraction system which outperforms in precision and recall to previous state-of-the-art extractors such as TEXTRUNNER and WOE. Two frequent types of errors in previous systems motivated this new extractor: incoherent extractions (where the extracted phrase has no meaningful interpretation) and uninformative extractions (where the extractions omit critical information). REVERB articulates two simple constraints on how binary relationships are expressed to avoid these problems.

A syntactic constraint is proposed to avoid incoherent relation phrases: a valid relation phrase should be either a verb, a verb followed by a preposition or a verb followed by nouns, adjectives or adverbs ending in a preposition. This constraint also reduces uninformative extractions but sometimes match relation phrases that are too specific and result in few instances. A lexical constraint is introduced to overcome this limitation: a valid relation phrase should take many distinct arguments in a large corpus. REVERB processes sentences using OpenNLP tools and then extracts valid relation phrases that satisfy both constraints.

Brief description of the extraction algorithm

The new extraction algorithm takes as input a POS-tagged and NP-chunked sentence and returns a set of $\langle x,r,y\rangle$ extraction triples. Given an input sentence s, the algorithm performs two steps:

For each verb v in s, find the longest sequence of words r_v such that r_v starts at v and r_v satisfies both the syntactic and the lexical constraints. If any pair of matches are adjacent or overlap in s, merge them into a single match.
For each relation phrase r_v, find the nearest noun phrase x to the left, such that x is not a relative pronoun, WHO-pronoun or existential. Then, find the nearest noun phrase y to the right. For every $\langle x,y\rangle$ pair found, return $\langle x,r,y\rangle$ as a valid extraction.

For example, the sentence:

Hudson was born in Hampstead, which is a suburb of London.

returns two valid extractions:

e1: <Hudson, was born in, Hampstead>
e2: <Hampstead, is a suburb of, London>

Experimental results

Since REVERB uses a specified model of relations for extraction, it requires labeled data only for assigning confidence scores to its results. Therefore, it uses two orders of magnitude fewer training examples for learning the confidence function: a training dataset was created by manually labeling the extractions from a set of 1,000 random sentences from the Web and Wikipedia as correct or incorrect.

Five different systems were compared against REVERB: REVERB-lex (a version of REVERB without the lexical constraint), TEXTRUNNER, TEXTRUNNER-R (a modification of TEXTRUNNER retrained on REVERB extractions), WOE-pos (the CRF version of WOE) and WOE-parse (the dependency path version). For testing purposes, a test set of 500 sentences was sampled from the Web and each extraction from each system was evaluated by two human judges independently. The following graph shows the AUC (area under the precision-recall curve) results from the 86% of the data where the judges agreed:

Related papers

REVERB is compared against two other open IE systems: TEXTRUNNER, described in Banko et al IJCAI 2007 and WOE (-pos and -parse), presented in Wu and Weld ACL 2010.

Comment

They've released the software and some outputs from the system. It is neat. --Brendan 18:45, 13 October 2011 (UTC)

@@ Line 1: / Line 1: @@
 == Citation ==
- Anthony Fader, Stephen Soderland, and Oren Etzioni. Identifying Relations for Open Information Extraction. Submitted, 2011
+Fader, A., Soderland, S. and Etzioni, O. 2011. Identifying Relations for Open Information Extraction. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing.
 == Online version ==
@@ Line 9: / Line 9: @@
 == Summary ==
-This [[Category::paper]] introduces the REVERB [[AddressesProblem::open information extraction]] system, which more than doubles the area under the precision-recall curve relative to previous extractors such as TEXTRUNNER and WOEpos.
+This [[Category::paper]] introduces REVERB, an [[AddressesProblem::Open Information Extraction]] system which outperforms in precision and recall to previous state-of-the-art extractors such as TEXTRUNNER and WOE. Two frequent types of errors in previous systems motivated this new extractor: incoherent extractions (where the extracted phrase has no meaningful interpretation) and uninformative extractions (where the extractions omit critical information). REVERB articulates two simple ''constraints'' on how binary relationships are expressed to avoid these problems.
+A ''syntactic constraint'' is proposed to avoid incoherent relation phrases: a valid relation phrase should be either a verb, a verb followed by a preposition or a verb followed by nouns, adjectives or adverbs ending in a preposition. This constraint also reduces uninformative extractions but sometimes match relation phrases that are too specific and result in few instances. A ''lexical constraint'' is introduced to overcome this limitation: a valid relation phrase should take many distinct arguments in a large corpus. REVERB processes sentences using [[UsesMethod::OpenNLP tools|OpenNLP tools]] and then extracts valid relation phrases that satisfy both constraints.
+=== Brief description of the extraction algorithm ===
+The new extraction algorithm takes as input a POS-tagged and NP-chunked sentence and returns a set of <math>\langle x, r, y \rangle</math> extraction triples. Given an input sentence ''s'', the algorithm performs two steps:
+* For each verb ''v'' in ''s'', find the longest sequence of words ''r<sub>v</sub>'' such that ''r<sub>v</sub>'' starts at ''v'' and ''r<sub>v</sub>'' satisfies both the syntactic and the lexical constraints. If any pair of matches are adjacent or overlap in ''s'', merge them into a single match.
+* For each relation phrase ''r<sub>v</sub>'', find the nearest noun phrase ''x'' to the left, such that ''x'' is not a relative pronoun, WHO-pronoun or existential. Then, find the nearest noun phrase ''y'' to the right. For every <math>\langle x, y \rangle</math> pair found, return <math>\langle x, r, y \rangle</math> as a valid extraction.
+For example, the sentence:
+ Hudson was born in Hampstead, which is a suburb of London.
+returns two valid extractions:
+ e1: <Hudson, was born in, Hampstead>
+ e2: <Hampstead, is a suburb of, London>
 == Experimental results ==
-...
+Since REVERB uses a specified model of relations for extraction, it requires labeled data only for assigning confidence scores to its results. Therefore, it uses two orders of magnitude fewer training examples for learning the confidence function: a training dataset was created by manually labeling the extractions from a set of 1,000 random sentences from the [[UsesDataset::Web pages|Web]] and [[UsesDataset::Wikipedia]] as correct or incorrect.
+Five different systems were compared against REVERB: REVERB-lex (a version of REVERB without the lexical constraint), TEXTRUNNER, TEXTRUNNER-R (a modification of TEXTRUNNER retrained on REVERB extractions), WOE-pos (the CRF version of WOE) and WOE-parse (the dependency path version). For testing purposes, a test set of 500 sentences was sampled from the [[UsesDataset::Web pages|Web]] and each extraction from each system was evaluated by two human judges independently. The following graph shows the AUC (area under the precision-recall curve) results from the 86% of the data where the judges agreed:
+[[File:ResultsGraph.png]]
 == Related papers ==
-REVERB is compared against other open IE systems: TextRunner, described in [[RelatedPaper::Banko et al IJCAI 2007]] and WOE, presented in [[RelatedPaper::Wu and Weld ACL 2010]].
+REVERB is compared against two other open IE systems: TEXTRUNNER, described in [[RelatedPaper::Banko et al IJCAI 2007]] and WOE (-pos and -parse), presented in [[RelatedPaper::Wu and Weld ACL 2010]].
+== Comment ==
+They've released the software and some outputs from the system.  It is neat.  --[[User:Brendan|Brendan]] 18:45, 13 October 2011 (UTC)

Difference between revisions of "Fader et al EMNLP 2011"

Latest revision as of 13:45, 13 October 2011

Contents

Citation

Online version

Summary

Brief description of the extraction algorithm

Experimental results

Related papers

Comment

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

Tools