Difference between revisions of "Garera 2009 Structural, Transitive and Latent Models for Biographic Fact Extraction"

From Cohen Courses
Jump to navigationJump to search
 
(19 intermediate revisions by the same user not shown)
Line 13: Line 13:
 
== Key Contributions ==
 
== Key Contributions ==
  
The paper presents a series of intuitive hypothesis with supporting empirical findings, which motivates the proposed approaches. The paper also suggests the use of NNDB in parallel with wikipedia text for training and testing.
+
The paper presents a series of intuitive hypothesis with supporting empirical findings, which motivates the proposed approaches. The paper also suggests the use of [[UsesDataset::NNDB dataset]] [http://www.nndb.com/] in parallel with [[UsesDataset::Wikipedia dataset]] [http://en.wikipedia.org] text for training and testing.
  
 
* Several intuitive hypotheses are presented, such as the position information, the topic information and correlations between the attributes are useful for the extraction. These hypotheses are also backed by empirical study which leads to skewed distribution. These hypotheses then motivated the position-based model, the latent-model-based model, the correlation-based filter etc.
 
* Several intuitive hypotheses are presented, such as the position information, the topic information and correlations between the attributes are useful for the extraction. These hypotheses are also backed by empirical study which leads to skewed distribution. These hypotheses then motivated the position-based model, the latent-model-based model, the correlation-based filter etc.
Line 19: Line 19:
 
* The paper focuses on the problem of extracting attributes from wikipedia text. However, the information box of wikipedia is usually incomplete, thus the use of NNDB is proposed to serve as the gold standard for the attribute extraction. Though the author does not specify explicitly, the use of any state-of-art clean person-attribute database is well suited for this purpose, too.
 
* The paper focuses on the problem of extracting attributes from wikipedia text. However, the information box of wikipedia is usually incomplete, thus the use of NNDB is proposed to serve as the gold standard for the attribute extraction. Though the author does not specify explicitly, the use of any state-of-art clean person-attribute database is well suited for this purpose, too.
  
== Proprietary Voicemail Transcription Dataset ==
+
== Approaches and Empirical Findings ==
The paper uses a proprietary data set consisting of almost 10,000 voicemail messages with manual transcription and marks. As illustrated in the following excerpt.
+
The authors present a set of six approaches to exploit different information for [[AddressesProblem::attribute extraction]]. The author claims that though the approaches have different focus and its most applicable attributes, they can be applied for all the attributes and the combined back-off model performs the best in the experiments.
  
<blockquote><greeting> hi Jane </greeting> <caller> this is Pat Caller </caller> I just wanted to I know you’ve probably seen this or maybe you already know about it . . . so if you could give me a call at <telno> one two three four five </telno> when you get the message I’d like to chat about it hope things are well with you <closing> talk to you soon </closing></blockquote>
+
* ''Improved Contextual Pattern-based Model''
 +
The author first refers to the contextual pattern-based model proposed in Ravichandran and Hovy (2002). The model is essentially a [[UsesMethod::probabilistic model with template-based features]], in which the features are derived from frequently occurred context. For example, in the case of "<Name> 'was born in' <Birthplace>", the <Name> and <Birthplace> are placeholders and the 'was born in' is the relevant context. The author also mentions the lost of variability for this method and proposes another method which takes the partially untethered template as the context. In their experiments, this method yields a significant gain in accuracy of 21%.
  
== Algorithms and Experimental Results ==
+
* ''Document-Position-Based Model''
The authors present two set of algorithms on two different sub-tasks, Caller Identification/Phone Number Extraction.
+
The author presents the intuition of position-based approach with Gaussian model for extracting different attributes. Using the figure below, the author shows the distributions clearly resembling the Gaussian. An alternative approach is the ranking-based model, which is claimed to be very effective in extraction of some attributes, for example Deathdate.
  
* For Caller Identification (Caller Phrase/Caller Name), the authors focus on two targeted variables - the starting position of the Caller Phrase/Caller Name and the length of it. The authors first show empirical distributions of the position and the length, which are both highly skewed. The authors then apply decision tree learner with a small set of features based on common words for learning these two targeted variables. The authors then present the comparison between their results and those from previous work. In fact, the new algorithm performs worse than the previous method on their dataset. However, they do observe a great improvement over previous method on unseen data, which they picked the ASR (automatic speech recognition) results. Thus the authors argue that the previous method with generic named entity recognizer tends to overfit the dataset and the new algorithm is more robust for unseen data. They also transfer this technique to extract Caller Name instead of Caller Phrase.
+
[[File:Nikesh_1.PNG]]
  
* For Phone Number Extraction, the authors propose a two phrase approach. In the first phrase, the algorithm uses a hand-crafted grammar to propose candidate phone numbers and convert them into numeric presentation. In the second phrase, the algorithm uses a binary classifier to consider the validity of every phone number candidate. In the performance comparison, this rather simple method shows great effectiveness and the authors present a 10% improvement on F-measure over previous method.
+
* ''Transitivity-Based Model''
 +
The author mentions that the attributes such as Occupations are transitive in nature, in which the people names appearing close to the target would probably have the same occupation as the target name. Thus a transitivity-based model is proposed, mostly for the attributes such as Occupations, Religions etc. The following figure is used by the author to illustrate this approach.
 +
 
 +
[[File:Nikesh_2.PNG]]
 +
 
 +
* ''Latent-Model-Based Approach''
 +
In addition to the transitivity-based model, the Occupation and other semantic attributes may also be modeled with a [[UsesMethod::topic model]] approach, as mentioned by the author. For example, a page about a scientist is definitely very different from a page about a basketball player. The author uses a scoring method similar to that of [[UsesMethod::TD-IDF similarity measures]] in information retrieval. And the author also points out that multilingual resources would help to resolve the ambiguity more for this approach.
 +
 
 +
* ''Attribute-Correlation-Based Filter''
 +
Author gives an example of P(Reg.=Hindu|Nation=Indian)>>P(Reg.=Hindu|Nation=France) to illustrate the correlation between two attributes. Then the author presents their method which uses the training data to filter all unseen correlations in the output.
 +
 
 +
* ''Age-Distribution-Based Filter''
 +
Author presents the intuition of using age range to filter wrong value of Birthdate and Deathdate pairs, this is claimed to have an average gain of 5% in their experiments.
  
 
== Related papers ==
 
== Related papers ==
  
The [[RelatedPaper::Huang et al., 2001]] paper discussed a very similar problem but rather with a traditional perspective, it studied three approaches: hand-crafted rules, grammatical inference of sub-sequential transducers and log-linear classifier with bi-gram and tri-gram features, which is essentially the same as in [[RelatedPaper::Ratnaparkhi, 1996]] paper on Maxent POS tagging.
+
The [[RelatedPaper::Ravichandran and Hovy, ACL 2002: Learning Surface Text Patterns for a Question Answering System]] paper serves as the foundation for this paper, in which it discusses a basic template-based approach for probabilistic modeling of attribute extractions. Later [[RelatedPaper:: Mann and Yarowsky, 2005]] extends the methods for cross-document fusion of the attribute values.

Latest revision as of 23:40, 30 November 2010

Citation

Garera, N. and Yarowsky, D. 2009. Structural, Transitive and Latent Models for Biographic Fact Extraction. In Proceedings of EACL.

Online version

An online version of this paper is available at [1].

Summary

This paper introduces several approaches to extract biographical information from wikipedia texts, these approaches are individually inspired by empirical study of the corpus. Collectively they perform significantly better than the traditional methods and the combined method is among the current state-of-art methods.

Key Contributions

The paper presents a series of intuitive hypothesis with supporting empirical findings, which motivates the proposed approaches. The paper also suggests the use of NNDB dataset [2] in parallel with Wikipedia dataset [3] text for training and testing.

  • Several intuitive hypotheses are presented, such as the position information, the topic information and correlations between the attributes are useful for the extraction. These hypotheses are also backed by empirical study which leads to skewed distribution. These hypotheses then motivated the position-based model, the latent-model-based model, the correlation-based filter etc.
  • The paper focuses on the problem of extracting attributes from wikipedia text. However, the information box of wikipedia is usually incomplete, thus the use of NNDB is proposed to serve as the gold standard for the attribute extraction. Though the author does not specify explicitly, the use of any state-of-art clean person-attribute database is well suited for this purpose, too.

Approaches and Empirical Findings

The authors present a set of six approaches to exploit different information for attribute extraction. The author claims that though the approaches have different focus and its most applicable attributes, they can be applied for all the attributes and the combined back-off model performs the best in the experiments.

  • Improved Contextual Pattern-based Model

The author first refers to the contextual pattern-based model proposed in Ravichandran and Hovy (2002). The model is essentially a probabilistic model with template-based features, in which the features are derived from frequently occurred context. For example, in the case of "<Name> 'was born in' <Birthplace>", the <Name> and <Birthplace> are placeholders and the 'was born in' is the relevant context. The author also mentions the lost of variability for this method and proposes another method which takes the partially untethered template as the context. In their experiments, this method yields a significant gain in accuracy of 21%.

  • Document-Position-Based Model

The author presents the intuition of position-based approach with Gaussian model for extracting different attributes. Using the figure below, the author shows the distributions clearly resembling the Gaussian. An alternative approach is the ranking-based model, which is claimed to be very effective in extraction of some attributes, for example Deathdate.

Nikesh 1.PNG

  • Transitivity-Based Model

The author mentions that the attributes such as Occupations are transitive in nature, in which the people names appearing close to the target would probably have the same occupation as the target name. Thus a transitivity-based model is proposed, mostly for the attributes such as Occupations, Religions etc. The following figure is used by the author to illustrate this approach.

Nikesh 2.PNG

  • Latent-Model-Based Approach

In addition to the transitivity-based model, the Occupation and other semantic attributes may also be modeled with a topic model approach, as mentioned by the author. For example, a page about a scientist is definitely very different from a page about a basketball player. The author uses a scoring method similar to that of TD-IDF similarity measures in information retrieval. And the author also points out that multilingual resources would help to resolve the ambiguity more for this approach.

  • Attribute-Correlation-Based Filter

Author gives an example of P(Reg.=Hindu|Nation=Indian)>>P(Reg.=Hindu|Nation=France) to illustrate the correlation between two attributes. Then the author presents their method which uses the training data to filter all unseen correlations in the output.

  • Age-Distribution-Based Filter

Author presents the intuition of using age range to filter wrong value of Birthdate and Deathdate pairs, this is claimed to have an average gain of 5% in their experiments.

Related papers

The Ravichandran and Hovy, ACL 2002: Learning Surface Text Patterns for a Question Answering System paper serves as the foundation for this paper, in which it discusses a basic template-based approach for probabilistic modeling of attribute extractions. Later Mann and Yarowsky, 2005 extends the methods for cross-document fusion of the attribute values.