Difference between revisions of "Garera 2009 Structural, Transitive and Latent Models for Biographic Fact Extraction"
PastStudents (talk | contribs) |
PastStudents (talk | contribs) |
||
Line 20: | Line 20: | ||
== Approaches and Empirical Findings == | == Approaches and Empirical Findings == | ||
− | The authors present | + | The authors present a set of six approaches to exploit different information for attribute extraction. The author claims that though the approaches have different focus and its most applicable attributes, they can be applied for all the attributes and the combined back-off model performs the best in the experiments. |
+ | |||
+ | * "Improved Contextual Pattern-based Model" | ||
− | |||
* For Phone Number Extraction, the authors propose a two phrase approach. In the first phrase, the algorithm uses a hand-crafted grammar to propose candidate phone numbers and convert them into numeric presentation. In the second phrase, the algorithm uses a binary classifier to consider the validity of every phone number candidate. In the performance comparison, this rather simple method shows great effectiveness and the authors present a 10% improvement on F-measure over previous method. | * For Phone Number Extraction, the authors propose a two phrase approach. In the first phrase, the algorithm uses a hand-crafted grammar to propose candidate phone numbers and convert them into numeric presentation. In the second phrase, the algorithm uses a binary classifier to consider the validity of every phone number candidate. In the performance comparison, this rather simple method shows great effectiveness and the authors present a 10% improvement on F-measure over previous method. |
Revision as of 18:05, 31 October 2010
Contents
Citation
Garera, N. and Yarowsky, D. 2009. Structural, Transitive and Latent Models for Biographic Fact Extraction. In Proceedings of EACL.
Online version
An online version of this paper is available at [1].
Summary
This paper introduces several approaches to extract biographical information from wikipedia texts, these approaches are individually inspired by empirical study of the corpus. Collectively they perform significantly better than the traditional methods and the combined method is among the current state-of-art methods.
Key Contributions
The paper presents a series of intuitive hypothesis with supporting empirical findings, which motivates the proposed approaches. The paper also suggests the use of NNDB in parallel with wikipedia text for training and testing.
- Several intuitive hypotheses are presented, such as the position information, the topic information and correlations between the attributes are useful for the extraction. These hypotheses are also backed by empirical study which leads to skewed distribution. These hypotheses then motivated the position-based model, the latent-model-based model, the correlation-based filter etc.
- The paper focuses on the problem of extracting attributes from wikipedia text. However, the information box of wikipedia is usually incomplete, thus the use of NNDB is proposed to serve as the gold standard for the attribute extraction. Though the author does not specify explicitly, the use of any state-of-art clean person-attribute database is well suited for this purpose, too.
Approaches and Empirical Findings
The authors present a set of six approaches to exploit different information for attribute extraction. The author claims that though the approaches have different focus and its most applicable attributes, they can be applied for all the attributes and the combined back-off model performs the best in the experiments.
- "Improved Contextual Pattern-based Model"
- For Phone Number Extraction, the authors propose a two phrase approach. In the first phrase, the algorithm uses a hand-crafted grammar to propose candidate phone numbers and convert them into numeric presentation. In the second phrase, the algorithm uses a binary classifier to consider the validity of every phone number candidate. In the performance comparison, this rather simple method shows great effectiveness and the authors present a 10% improvement on F-measure over previous method.
Related papers
The Huang et al., 2001 paper discussed a very similar problem but rather with a traditional perspective, it studied three approaches: hand-crafted rules, grammatical inference of sub-sequential transducers and log-linear classifier with bi-gram and tri-gram features, which is essentially the same as in Ratnaparkhi, 1996 paper on Maxent POS tagging.