Difference between revisions of "Hoffmann et al., ACL 2010"
(6 intermediate revisions by the same user not shown) | |||
Line 20: | Line 20: | ||
[[File:LUCHSArchitecture.png|450px]] | [[File:LUCHSArchitecture.png|450px]] | ||
− | * A | + | * A '''Schema Classifier''' is trained by [[Wikipedia|Wikipedia pages]] containing infoboxes, which can be used to decide which schema should be applied to an article without infobox; |
− | * Training Data is generated heuristically by the '''Matcher''' (e.g. the [[Wikipedia]] article "Jerry Seinfeld" contains sentence "Seinfeld was born in Brooklyn, New York." and at the infobox within the same page, contains relation pair "birth_place = Brooklyn"; the '''Matcher''' heuristically generate training data for extractors of different relations; | + | * Training Data is generated heuristically by the '''Matcher''' (e.g. the [[Wikipedia]] article "Jerry Seinfeld" contains sentence "Seinfeld was born in Brooklyn, New York." and at the infobox within the same page, contains relation pair "birth_place = Brooklyn"; the '''Matcher''' heuristically generate training data for extractors of different relations (but in this paper the authors didn't describe any details about this part); |
+ | * A language model is trained by '''CRF Learner''', using the training data generated by '''Matcher''', and used as a '''Extractor''' to extract structured information from free text; | ||
+ | * As the major contribution of this paper, a '''Lexicon Learner''' is trained by HTML lists crawled from Internet, and contribute lexicons feature to CRF Learner. This step enables the system working with "sparse relations", or say extract structured information in a semi-supervised way. | ||
+ | == Brief description of the method == | ||
+ | === Schema Classifier === | ||
+ | The system uses a liner, multi-class classifier with 6 kinds of features: | ||
+ | * Words in the article title; | ||
+ | * Words in the first sentence; | ||
+ | * Words in the first sentence which are direct objects to the verb 'to be' (''what is that?'') | ||
+ | * Article section headers; | ||
+ | * Wikipedia categories; | ||
+ | * Ancestor categories. | ||
+ | [[Voted_Perceptron|Voted Perceptron]] is used for training the liner classifier. | ||
+ | === Extractor === | ||
+ | The extractor uses a language model trained by a linear-chain [[Conditional_Random_Fields|Conditional Random Fields (CRF)]] introduced by [[Lafferty_2001_Conditional_Random_Fields| Lafferty et al.,2001]]: | ||
+ | |||
+ | <math>p(y|x)=\frac{1}{Z(x)}exp\Sigma_{t=1}^{T}\Sigma_{k=1}^{K}\lambda_kf_k(y_{t-1},y_t,x,t)</math> | ||
+ | |||
+ | where <math>T</math> is the length of the sequence, <math>K</math> is the number of feature functions, feature functions <math>f</math> encodes statistics of pair <math>(x,y)</math>, and <math>\lambda_k</math> are parameters of feature weights. | ||
− | == | + | In the system, parameters <math>\lambda_k</math> are trained by [[Voted_Perceptron|Voted Perceptron]] algorithm. Nine kinds of Boolean features are evolved in the training: |
+ | * Words; | ||
+ | * State Transitions; | ||
+ | * Word Contextualization; | ||
+ | * Capitalization; | ||
+ | * Digits; | ||
+ | * Dependencies; | ||
+ | * First Sentence; | ||
+ | * Gaussians; | ||
+ | * Lexicons. | ||
+ | === Extraction with Lexicons === | ||
== Experimental Result == | == Experimental Result == | ||
+ | ===Dataset=== | ||
+ | 10/2008 English [[Wikipedia]] dump; | ||
+ | * 1,583 schema which contains at least 10 instances (wiki pages); | ||
+ | * 981,387 articles; | ||
+ | * 5,025 attributes (relations). | ||
+ | |||
+ | ===Overall Extraction Performance=== | ||
+ | They reported a precision of .55 at recall of .68, giving an F1-score of .61. | ||
== Related papers == | == Related papers == | ||
+ | This paper is a follow-up research based on [[Wu_and_Weld_CIKM_2007|KYLIN]] IE system, and other researches conducted by the University of Washington: [[Weld_et_al_SIGMOD_2009|Weld et al SIGMOD 2009]], [[Wu_and_Weld_ACL_2010|Wu and Weld ACL 2010]] and [[Wu_and_Weld_WWW_2008|Wu and Weld WWW 2008]]. | ||
+ | |||
+ | They use [[DBpedia]] as the training dataset. And [[D._Lange_et_al.,_CIKM_2010|iPopulator]] paper describes a similar research. |
Latest revision as of 13:34, 30 September 2011
Contents
Citation
Raphael Hoffmann, Congle Zhang, and Daniel S. Weld. 2010. Learning 5000 relational extractors. In ACL '10 (Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics)
Online version
Summary
This is a paper introducing LUCHS, a self-supervised relations-specific IE system capable of learning more than 5000 relations with an average F1 score of 61%. The system applies dynamic lexicon features learning is applied as a semi-supervise learning solution cope with sparse training data.
System Architecture
The following figure summarizes the architecture of LUCHS.
- A Schema Classifier is trained by Wikipedia pages containing infoboxes, which can be used to decide which schema should be applied to an article without infobox;
- Training Data is generated heuristically by the Matcher (e.g. the Wikipedia article "Jerry Seinfeld" contains sentence "Seinfeld was born in Brooklyn, New York." and at the infobox within the same page, contains relation pair "birth_place = Brooklyn"; the Matcher heuristically generate training data for extractors of different relations (but in this paper the authors didn't describe any details about this part);
- A language model is trained by CRF Learner, using the training data generated by Matcher, and used as a Extractor to extract structured information from free text;
- As the major contribution of this paper, a Lexicon Learner is trained by HTML lists crawled from Internet, and contribute lexicons feature to CRF Learner. This step enables the system working with "sparse relations", or say extract structured information in a semi-supervised way.
Brief description of the method
Schema Classifier
The system uses a liner, multi-class classifier with 6 kinds of features:
- Words in the article title;
- Words in the first sentence;
- Words in the first sentence which are direct objects to the verb 'to be' (what is that?)
- Article section headers;
- Wikipedia categories;
- Ancestor categories.
Voted Perceptron is used for training the liner classifier.
Extractor
The extractor uses a language model trained by a linear-chain Conditional Random Fields (CRF) introduced by Lafferty et al.,2001:
where is the length of the sequence, is the number of feature functions, feature functions encodes statistics of pair , and are parameters of feature weights.
In the system, parameters are trained by Voted Perceptron algorithm. Nine kinds of Boolean features are evolved in the training:
- Words;
- State Transitions;
- Word Contextualization;
- Capitalization;
- Digits;
- Dependencies;
- First Sentence;
- Gaussians;
- Lexicons.
Extraction with Lexicons
Experimental Result
Dataset
10/2008 English Wikipedia dump;
- 1,583 schema which contains at least 10 instances (wiki pages);
- 981,387 articles;
- 5,025 attributes (relations).
Overall Extraction Performance
They reported a precision of .55 at recall of .68, giving an F1-score of .61.
Related papers
This paper is a follow-up research based on KYLIN IE system, and other researches conducted by the University of Washington: Weld et al SIGMOD 2009, Wu and Weld ACL 2010 and Wu and Weld WWW 2008.
They use DBpedia as the training dataset. And iPopulator paper describes a similar research.