Garera 2009 Structural, Transitive and Latent Models for Biographic Fact Extraction
Garera, N. and Yarowsky, D. 2009. Structural, Transitive and Latent Models for Biographic Fact Extraction. In Proceedings of EACL.
An online version of this paper is available at .
This paper introduces several approaches to extract biographical information from wikipedia texts, these approaches are individually inspired by empirical study of the corpus. Collectively they perform significantly better than the traditional methods and the combined method is among the current state-of-art methods.
The paper presents a series of intuitive hypothesis with supporting empirical findings, which motivates the proposed approaches. The paper also suggests the use of NNDB dataset  in parallel with Wikipedia dataset  text for training and testing.
- Several intuitive hypotheses are presented, such as the position information, the topic information and correlations between the attributes are useful for the extraction. These hypotheses are also backed by empirical study which leads to skewed distribution. These hypotheses then motivated the position-based model, the latent-model-based model, the correlation-based filter etc.
- The paper focuses on the problem of extracting attributes from wikipedia text. However, the information box of wikipedia is usually incomplete, thus the use of NNDB is proposed to serve as the gold standard for the attribute extraction. Though the author does not specify explicitly, the use of any state-of-art clean person-attribute database is well suited for this purpose, too.
Approaches and Empirical Findings
The authors present a set of six approaches to exploit different information for attribute extraction. The author claims that though the approaches have different focus and its most applicable attributes, they can be applied for all the attributes and the combined back-off model performs the best in the experiments.
- Improved Contextual Pattern-based Model
The author first refers to the contextual pattern-based model proposed in Ravichandran and Hovy (2002). The model is essentially a probabilistic model with template-based features, in which the features are derived from frequently occurred context. For example, in the case of "<Name> 'was born in' <Birthplace>", the <Name> and <Birthplace> are placeholders and the 'was born in' is the relevant context. The author also mentions the lost of variability for this method and proposes another method which takes the partially untethered template as the context. In their experiments, this method yields a significant gain in accuracy of 21%.
- Document-Position-Based Model
The author presents the intuition of position-based approach with Gaussian model for extracting different attributes. Using the figure below, the author shows the distributions clearly resembling the Gaussian. An alternative approach is the ranking-based model, which is claimed to be very effective in extraction of some attributes, for example Deathdate.
- Transitivity-Based Model
The author mentions that the attributes such as Occupations are transitive in nature, in which the people names appearing close to the target would probably have the same occupation as the target name. Thus a transitivity-based model is proposed, mostly for the attributes such as Occupations, Religions etc. The following figure is used by the author to illustrate this approach.
- Latent-Model-Based Approach
In addition to the transitivity-based model, the Occupation and other semantic attributes may also be modeled with a topic model approach, as mentioned by the author. For example, a page about a scientist is definitely very different from a page about a basketball player. The author uses a scoring method similar to that of TD-IDF similarity measures in information retrieval. And the author also points out that multilingual resources would help to resolve the ambiguity more for this approach.
- Attribute-Correlation-Based Filter
Author gives an example of P(Reg.=Hindu|Nation=Indian)>>P(Reg.=Hindu|Nation=France) to illustrate the correlation between two attributes. Then the author presents their method which uses the training data to filter all unseen correlations in the output.
- Age-Distribution-Based Filter
Author presents the intuition of using age range to filter wrong value of Birthdate and Deathdate pairs, this is claimed to have an average gain of 5% in their experiments.
The Ravichandran and Hovy, ACL 2002: Learning Surface Text Patterns for a Question Answering System paper serves as the foundation for this paper, in which it discusses a basic template-based approach for probabilistic modeling of attribute extractions. Later Mann and Yarowsky, 2005 extends the methods for cross-document fusion of the attribute values.