Difference between revisions of "Mann 2005 Multi-Field Information Extraction and Cross-Document Fusion"

From Cohen Courses
Jump to navigationJump to search
Line 21: Line 21:
 
The author first mentions that statistical extraction systems (such as HMMs and CRFs) are trained using hand-annotated data. Annotating the necessary data by hand is time consuming and brittle, since it may require large scale re-annotation when the annotation scheme changes. However, for the training of Rote model, an alternative is available which directly computes the probability of positive sample. The author then extends the method carefully to adapt to the Naive Bayes and Conditional Random Fields and show good performance, in particular the CRF-based model with negative samples.
 
The author first mentions that statistical extraction systems (such as HMMs and CRFs) are trained using hand-annotated data. Annotating the necessary data by hand is time consuming and brittle, since it may require large scale re-annotation when the annotation scheme changes. However, for the training of Rote model, an alternative is available which directly computes the probability of positive sample. The author then extends the method carefully to adapt to the Naive Bayes and Conditional Random Fields and show good performance, in particular the CRF-based model with negative samples.
  
* ''Document-Position-Based Model''
+
* ''Cross-Document Information Fusion''
The author presents the intuition of position-based approach with Gaussian model for extracting different attributes. Using the figure below, the author shows the distributions clearly resembling the Gaussian. An alternative approach is the ranking-based model, which is claimed to be very effective in extraction of some attributes, for example Deathdate.
+
The author presents a novel approach to combine the attribute values extracted for one person across the documents.
  
 
[[File:Nikesh_1.PNG]]
 
[[File:Nikesh_1.PNG]]

Revision as of 00:57, 1 December 2010

Citation

Mann, G. and Yarowsky, D. 2005. Multi-Field Information Extraction and Cross-Document Fusion. In Proceedings of ACL.

Online version

An online version of this paper is available at [1].

Summary

This paper presents the method for extracting biographic facts about target individuals from web pages. It introduces and evaluates methods for fusing the extracted information across documents to return a consensus answer. It could be applied together with cross-document co-reference resolution for large-scale information extraction (in particular, person and their biographic facts).

Key Contributions

The paper presents several novel ideas, first one is the information fusion, which is widely applicable in either KnowItAll/NELL-like systems to leverage the redundant web data; second one is the automatic annotation and conditional random fields-based biographic facts extraction method; third one is the cross-field bootstrapping method which leverages data inter-dependencies.

Approaches and Experimental Results

The authors present a set of three novel approaches, from the automatic annotation with statistical models, to the information fusion, and to the inter-dependency model. The author claims that the automatic annotation is possible for training extraction models, the information fusion greatly helps improve the performance, and the inter-dependency model lifts individual performances.

  • Automatic Annotation for Training Statistical Extraction Models

The author first mentions that statistical extraction systems (such as HMMs and CRFs) are trained using hand-annotated data. Annotating the necessary data by hand is time consuming and brittle, since it may require large scale re-annotation when the annotation scheme changes. However, for the training of Rote model, an alternative is available which directly computes the probability of positive sample. The author then extends the method carefully to adapt to the Naive Bayes and Conditional Random Fields and show good performance, in particular the CRF-based model with negative samples.

  • Cross-Document Information Fusion

The author presents a novel approach to combine the attribute values extracted for one person across the documents.

Nikesh 1.PNG

  • Transitivity-Based Model

The author mentions that the attributes such as Occupations are transitive in nature, in which the people names appearing close to the target would probably have the same occupation as the target name. Thus a transitivity-based model is proposed, mostly for the attributes such as Occupations, Religions etc. The following figure is used by the author to illustrate this approach.

Nikesh 2.PNG

  • Latent-Model-Based Approach

In addition to the transitivity-based model, the Occupation and other semantic attributes may also be modeled with a topic model approach, as mentioned by the author. For example, a page about a scientist is definitely very different from a page about a basketball player. The author uses a scoring method similar to that of TD-IDF similarity measures in information retrieval. And the author also points out that multilingual resources would help to resolve the ambiguity more for this approach.

  • Attribute-Correlation-Based Filter

Author gives an example of P(Reg.=Hindu|Nation=Indian)>>P(Reg.=Hindu|Nation=France) to illustrate the correlation between two attributes. Then the author presents their method which uses the training data to filter all unseen correlations in the output.

  • Age-Distribution-Based Filter

Author presents the intuition of using age range to filter wrong value of Birthdate and Deathdate pairs, this is claimed to have an average gain of 5% in their experiments.

Related papers

The Ravichandran and Hovy, ACL 2002: Learning Surface Text Patterns for a Question Answering System paper serves as the foundation for this paper, in which it discusses a basic template-based approach for probabilistic modeling of attribute extractions. Later Mann and Yarowsky, 2005 extends the methods for cross-document fusion of the attribute values.