Mann 2005 Multi-Field Information Extraction and Cross-Document Fusion
Contents
Citation
Mann, G. and Yarowsky, D. 2005. Multi-Field Information Extraction and Cross-Document Fusion. In Proceedings of ACL.
Online version
An online version of this paper is available at [1].
Summary
This paper presents the method for extracting biographic facts about target individuals from web pages. It introduces and evaluates methods for fusing the extracted information across documents to return a consensus answer. It could be applied together with cross-document co-reference resolution for large-scale information extraction (in particular, person and their biographic facts).
Key Contributions
The paper presents several novel ideas, first one is the information fusion, which is widely applicable in either KnowItAll/NELL-like systems to leverage the redundant web data; second one is the automatic annotation and conditional random fields-based biographic facts extraction method; third one is the cross-field bootstrapping method which leverages data inter-dependencies.
Approaches and Experimental Results
The authors present a set of three novel approaches, from the automatic annotation with statistical models, to the information fusion, and to the inter-dependency model. The author claims that the automatic annotation is possible for training extraction models, the information fusion greatly helps improve the performance, and the inter-dependency model lifts individual performances.
- Automatic Annotation for Training Statistical Extraction Models
The author first mentions that statistical extraction systems (such as HMMs and CRFs) are trained using hand-annotated data. Annotating the necessary data by hand is time consuming and brittle, since it may require large scale re-annotation when the annotation scheme changes. However, for the training of Rote model, an alternative is available which directly computes the probability of positive sample. The author then extends the method carefully to adapt to the Naive Bayes and Conditional Random Fields and show good performance, in particular the CRF-based model with negative samples.
- Cross-Document Information Fusion
The author presents a novel approach to combine the attribute values extracted for one person across the documents. Two alternatives are considered, one is to pick the most probable value, the other is to vote out a most confident result. And the voting method outperforms the first one by almost 10% in each of the models tested.
- Cross-Field Bootstrapping
The author then proposes to bootstrap across fields and use knowledge of one relationship to improve performance on the extraction of another. For example, to extract birth year given knowledge of the birthday, the author argues that in training we could mark up each hook corpus with the known birthday b : birthday(x, b) and the target birth year y : birthyear(x, y) and add an additional feature to the CRF that indicates whether the birthday has been seen in the sentence. In testing, for each hook, we could first find the birthday using the methods presented in the previous sections, annotate the corpus with the extracted birthday, and then apply the birth year CRF.
Related papers
This paper is based on the previous work on biographic extraction with Ravichandran and Hovy, ACL 2002: Learning Surface Text Patterns for a Question Answering System and Garera 2009 Structural, Transitive and Latent Models for Biographic Fact Extraction.