Difference between revisions of "Fader et al EMNLP 2011"
(Created page with '== Citation == Anthony Fader, Stephen Soderland, and Oren Etzioni. Identifying Relations for Open Information Extraction. Submitted, 2011 == Online version == [http://ai.cs.w…') |
|||
(23 intermediate revisions by one other user not shown) | |||
Line 1: | Line 1: | ||
== Citation == | == Citation == | ||
− | + | Fader, A., Soderland, S. and Etzioni, O. 2011. Identifying Relations for Open Information Extraction. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing. | |
== Online version == | == Online version == | ||
Line 9: | Line 9: | ||
== Summary == | == Summary == | ||
− | This [[Category::paper]] introduces | + | This [[Category::paper]] introduces REVERB, an [[AddressesProblem::Open Information Extraction]] system which outperforms in precision and recall to previous state-of-the-art extractors such as TEXTRUNNER and WOE. Two frequent types of errors in previous systems motivated this new extractor: incoherent extractions (where the extracted phrase has no meaningful interpretation) and uninformative extractions (where the extractions omit critical information). REVERB articulates two simple ''constraints'' on how binary relationships are expressed to avoid these problems. |
+ | |||
+ | A ''syntactic constraint'' is proposed to avoid incoherent relation phrases: a valid relation phrase should be either a verb, a verb followed by a preposition or a verb followed by nouns, adjectives or adverbs ending in a preposition. This constraint also reduces uninformative extractions but sometimes match relation phrases that are too specific and result in few instances. A ''lexical constraint'' is introduced to overcome this limitation: a valid relation phrase should take many distinct arguments in a large corpus. REVERB processes sentences using [[UsesMethod::OpenNLP tools|OpenNLP tools]] and then extracts valid relation phrases that satisfy both constraints. | ||
+ | |||
+ | === Brief description of the extraction algorithm === | ||
+ | |||
+ | The new extraction algorithm takes as input a POS-tagged and NP-chunked sentence and returns a set of <math>\langle x, r, y \rangle</math> extraction triples. Given an input sentence ''s'', the algorithm performs two steps: | ||
+ | * For each verb ''v'' in ''s'', find the longest sequence of words ''r<sub>v</sub>'' such that ''r<sub>v</sub>'' starts at ''v'' and ''r<sub>v</sub>'' satisfies both the syntactic and the lexical constraints. If any pair of matches are adjacent or overlap in ''s'', merge them into a single match. | ||
+ | * For each relation phrase ''r<sub>v</sub>'', find the nearest noun phrase ''x'' to the left, such that ''x'' is not a relative pronoun, WHO-pronoun or existential. Then, find the nearest noun phrase ''y'' to the right. For every <math>\langle x, y \rangle</math> pair found, return <math>\langle x, r, y \rangle</math> as a valid extraction. | ||
+ | |||
+ | For example, the sentence: | ||
+ | Hudson was born in Hampstead, which is a suburb of London. | ||
+ | returns two valid extractions: | ||
+ | e1: <Hudson, was born in, Hampstead> | ||
+ | e2: <Hampstead, is a suburb of, London> | ||
== Experimental results == | == Experimental results == | ||
− | ... | + | Since REVERB uses a specified model of relations for extraction, it requires labeled data only for assigning confidence scores to its results. Therefore, it uses two orders of magnitude fewer training examples for learning the confidence function: a training dataset was created by manually labeling the extractions from a set of 1,000 random sentences from the [[UsesDataset::Web pages|Web]] and [[UsesDataset::Wikipedia]] as correct or incorrect. |
+ | |||
+ | Five different systems were compared against REVERB: REVERB-lex (a version of REVERB without the lexical constraint), TEXTRUNNER, TEXTRUNNER-R (a modification of TEXTRUNNER retrained on REVERB extractions), WOE-pos (the CRF version of WOE) and WOE-parse (the dependency path version). For testing purposes, a test set of 500 sentences was sampled from the [[UsesDataset::Web pages|Web]] and each extraction from each system was evaluated by two human judges independently. The following graph shows the AUC (area under the precision-recall curve) results from the 86% of the data where the judges agreed: | ||
+ | |||
+ | [[File:ResultsGraph.png]] | ||
== Related papers == | == Related papers == | ||
− | REVERB is compared against other open IE systems: | + | REVERB is compared against two other open IE systems: TEXTRUNNER, described in [[RelatedPaper::Banko et al IJCAI 2007]] and WOE (-pos and -parse), presented in [[RelatedPaper::Wu and Weld ACL 2010]]. |
+ | |||
+ | == Comment == | ||
+ | |||
+ | They've released the software and some outputs from the system. It is neat. --[[User:Brendan|Brendan]] 18:45, 13 October 2011 (UTC) |
Latest revision as of 13:45, 13 October 2011
Contents
Citation
Fader, A., Soderland, S. and Etzioni, O. 2011. Identifying Relations for Open Information Extraction. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing.
Online version
Summary
This paper introduces REVERB, an Open Information Extraction system which outperforms in precision and recall to previous state-of-the-art extractors such as TEXTRUNNER and WOE. Two frequent types of errors in previous systems motivated this new extractor: incoherent extractions (where the extracted phrase has no meaningful interpretation) and uninformative extractions (where the extractions omit critical information). REVERB articulates two simple constraints on how binary relationships are expressed to avoid these problems.
A syntactic constraint is proposed to avoid incoherent relation phrases: a valid relation phrase should be either a verb, a verb followed by a preposition or a verb followed by nouns, adjectives or adverbs ending in a preposition. This constraint also reduces uninformative extractions but sometimes match relation phrases that are too specific and result in few instances. A lexical constraint is introduced to overcome this limitation: a valid relation phrase should take many distinct arguments in a large corpus. REVERB processes sentences using OpenNLP tools and then extracts valid relation phrases that satisfy both constraints.
Brief description of the extraction algorithm
The new extraction algorithm takes as input a POS-tagged and NP-chunked sentence and returns a set of extraction triples. Given an input sentence s, the algorithm performs two steps:
- For each verb v in s, find the longest sequence of words rv such that rv starts at v and rv satisfies both the syntactic and the lexical constraints. If any pair of matches are adjacent or overlap in s, merge them into a single match.
- For each relation phrase rv, find the nearest noun phrase x to the left, such that x is not a relative pronoun, WHO-pronoun or existential. Then, find the nearest noun phrase y to the right. For every pair found, return as a valid extraction.
For example, the sentence:
Hudson was born in Hampstead, which is a suburb of London.
returns two valid extractions:
e1: <Hudson, was born in, Hampstead> e2: <Hampstead, is a suburb of, London>
Experimental results
Since REVERB uses a specified model of relations for extraction, it requires labeled data only for assigning confidence scores to its results. Therefore, it uses two orders of magnitude fewer training examples for learning the confidence function: a training dataset was created by manually labeling the extractions from a set of 1,000 random sentences from the Web and Wikipedia as correct or incorrect.
Five different systems were compared against REVERB: REVERB-lex (a version of REVERB without the lexical constraint), TEXTRUNNER, TEXTRUNNER-R (a modification of TEXTRUNNER retrained on REVERB extractions), WOE-pos (the CRF version of WOE) and WOE-parse (the dependency path version). For testing purposes, a test set of 500 sentences was sampled from the Web and each extraction from each system was evaluated by two human judges independently. The following graph shows the AUC (area under the precision-recall curve) results from the 86% of the data where the judges agreed:
Related papers
REVERB is compared against two other open IE systems: TEXTRUNNER, described in Banko et al IJCAI 2007 and WOE (-pos and -parse), presented in Wu and Weld ACL 2010.
Comment
They've released the software and some outputs from the system. It is neat. --Brendan 18:45, 13 October 2011 (UTC)