Difference between revisions of "Garera 2009 Structural, Transitive and Latent Models for Biographic Fact Extraction"

From Cohen Courses
Jump to navigationJump to search
Line 13: Line 13:
 
== Key Contributions ==
 
== Key Contributions ==
  
The paper presents the following key findings between the trade-off of heuristic based simple classifier and ngram based sophisticated classifier
+
The paper presents a series of intuitive hypothesis with supporting empirical findings, which motivates the proposed approaches. The paper also suggests the use of NNDB in parallel with wikipedia text for training and testing.
  
* The second one might have serious over fitting problems and prone to errors in unseen values of attributes (for example in ASR outputs)
+
* Several intuitive hypotheses are presented, such as the position information, the topic information and correlations between the attributes are useful for the extraction. These hypotheses are also backed by empirical study which leads to skewed distribution. These hypotheses then motivated the position-based model, the latent-model-based model, the correlation-based filter etc.
  
* The first one exploits both the linguistics intuitions and empirical distributions thus is able to rely on strong heuristics and simple classifier, and the paper introduces an interesting two phrases approach on such learning problem
+
* The paper focuses on the problem of extracting attributes from wikipedia text. However, the information box of wikipedia is usually incomplete, thus the use of NNDB is proposed to serve as the gold standard for the attribute extraction. Though the author does not specify explicitly, the use of any state-of-art clean person-attribute database is well suited for this purpose, too.
  
 
== Introduction ==
 
== Introduction ==

Revision as of 18:00, 31 October 2010

Citation

Garera, N. and Yarowsky, D. 2009. Structural, Transitive and Latent Models for Biographic Fact Extraction. In Proceedings of EACL.

Online version

An online version of this paper is available at [1].

Summary

This paper introduces several approaches to extract biographical information from wikipedia texts, these approaches are individually inspired by empirical study of the corpus. Collectively they perform significantly better than the traditional methods and the combined method is among the current state-of-art methods.

Key Contributions

The paper presents a series of intuitive hypothesis with supporting empirical findings, which motivates the proposed approaches. The paper also suggests the use of NNDB in parallel with wikipedia text for training and testing.

  • Several intuitive hypotheses are presented, such as the position information, the topic information and correlations between the attributes are useful for the extraction. These hypotheses are also backed by empirical study which leads to skewed distribution. These hypotheses then motivated the position-based model, the latent-model-based model, the correlation-based filter etc.
  • The paper focuses on the problem of extracting attributes from wikipedia text. However, the information box of wikipedia is usually incomplete, thus the use of NNDB is proposed to serve as the gold standard for the attribute extraction. Though the author does not specify explicitly, the use of any state-of-art clean person-attribute database is well suited for this purpose, too.

Introduction

The paper first introduces the major problem of information extraction for voicemail is to identify the caller and a call back number if available. This, if not extracted, will take a person 36 seconds in average since she/he has to listen to the whole voicemail.

This paper focuses on only the transcribed voicemail text instead of including a speech recognition front-end. However, it still differs from traditional Named Entity Recognition and Phone Number Extraction tasks for two major reasons: one is that the voicemail transcript is text based on spoken language thus the linguistic elements are different (for example, 400-1425 can be referred to as four zero zero fourteen twenty two), the other is the structure of voicemail can be exploited to better extract caller identifications without sophisticated structured model for NER.

Instead of using the MaxEnt extractor with ngram features as described in the previous work, this paper first shows some empirical analysis of the data and focuses on heuristics based features with decision tree model.

Proprietary Voicemail Transcription Dataset

The paper uses a proprietary data set consisting of almost 10,000 voicemail messages with manual transcription and marks. As illustrated in the following excerpt.

<greeting> hi Jane </greeting> <caller> this is Pat Caller </caller> I just wanted to I know you’ve probably seen this or maybe you already know about it . . . so if you could give me a call at <telno> one two three four five </telno> when you get the message I’d like to chat about it hope things are well with you <closing> talk to you soon </closing>

Algorithms and Experimental Results

The authors present two set of algorithms on two different sub-tasks, Caller Identification/Phone Number Extraction.

  • For Caller Identification (Caller Phrase/Caller Name), the authors focus on two targeted variables - the starting position of the Caller Phrase/Caller Name and the length of it. The authors first show empirical distributions of the position and the length, which are both highly skewed. The authors then apply decision tree learner with a small set of features based on common words for learning these two targeted variables. The authors then present the comparison between their results and those from previous work. In fact, the new algorithm performs worse than the previous method on their dataset. However, they do observe a great improvement over previous method on unseen data, which they picked the ASR (automatic speech recognition) results. Thus the authors argue that the previous method with generic named entity recognizer tends to overfit the dataset and the new algorithm is more robust for unseen data. They also transfer this technique to extract Caller Name instead of Caller Phrase.
  • For Phone Number Extraction, the authors propose a two phrase approach. In the first phrase, the algorithm uses a hand-crafted grammar to propose candidate phone numbers and convert them into numeric presentation. In the second phrase, the algorithm uses a binary classifier to consider the validity of every phone number candidate. In the performance comparison, this rather simple method shows great effectiveness and the authors present a 10% improvement on F-measure over previous method.

Related papers

The Huang et al., 2001 paper discussed a very similar problem but rather with a traditional perspective, it studied three approaches: hand-crafted rules, grammatical inference of sub-sequential transducers and log-linear classifier with bi-gram and tri-gram features, which is essentially the same as in Ratnaparkhi, 1996 paper on Maxent POS tagging.