Huang et al, Coling 2010: Enhancing Cross Document Coreference of Web Documents with Context Similarity and Very Large Scale Text Categorization

From Cohen Courses
Jump to navigationJump to search

Citation

Jian Huang, Pucktada Treeratpituk, Sarah M. Taylor and C. Lee Giles. 2010. Enhancing Cross Document Coreference of Web Documents with Context Similarity and Very Large Scale Text Categorization. In Proceedings of the 23rd International Conference on Computational Linguistics (Coling), pages 483–491.

Online version

An online version of this paper is available [1].

Summary

This paper addresses the problem of Cross Document Coreference (CDC) for web-scale coropora of documents, by using document-level categories, sub-document level context and extracted entities and relations as features for a composite pairwise coreference function, and finally, using a density based clustering algorithm.

Very Large Scale Text Categorization Component

The authors use the Open Directory Project (ODP), which contains hundreds of thousands of categories labeled for 2 million Web pages. The authors adopt a flat multiclass online classification algorithm called Passive Aggressive (PA) to predict ranked categories for web documents. For a categorization problem with C categories, PA associates each category k with a weight vector, called its prototype. The degree of confidence for predicting category k with respect to an instance x (both in online training and testing) is determined by the similarity between the instance and the prototype. PA predicts a ranked list of categories according to this confidence. It is similar to Multiclass Perceptron but only updates two vectors per iteration and thus is more efficient.

The authors use the dot products between the lowest common ancestors of category pairs to measure similarity between the top K categories for any two documents.

Information Extraction Component

The authors use the information extraction tool AeroText. As mentioned in Huang et al, ACL 2009: Profile Based Cross-Document Coreference Using Kernelized Fuzzy Relational Clustering, AeroText extracts two types of information for an entity: the attribute information about the person named entity includes first/middle/last names, gender, mention, etc, and also, relationship information between named entities, such as Family, List, Employment, Ownership, Citizen-Resident-Religion-Ethnicity and so on, as specified in the ACE evaluation. AeroText resolves the references of entities within a document and produces entity profiles, used as input to their CDC system.

Context Matching Component

This component of their CDC system uses the context built from the sentences which form the NE’s within document coreference chain. The context is then represented as a term vector whose terms are weighted by the TF-IDF weighing scheme. For a pair of NEs, the context matching component measures the cosine similarity of their context term vectors.

Composite Pairwise Coreference

The authors use Random Forest (RF) to combine the experts components into one single composite pairwise similarity score. RF is an ensemble classifier, composed of a collection of randomized decision trees. Random Forests is very suitable for the CDC task, because of the following factors. Firstly, not all the CDC features may be active, and RF can handle this problem. Secondly, the features may be active but not predicted with the desired level of confidence.

Clustering

Using the confidence of the pairwise coreference prediction as a distance metric, the authors adopt a density-based clustering method called DBSCAN to induce the clusters corresponding to distinct entities.

Experiments and Evaluation

The authors use the ACL SemEval-2007 web person search task (WePS). The authors use the standard purity and inverse purity clustering metrics as in the WePS evaluation, and also the B-Cubed metric traditionally used in Within Document Coreference (WDC). The authors report purity of 0.812, inverse purity of 0.796 and an F score of 0.793. This compares better than the results of the first tier systems in the WePS 2007 official evaluation, and also their previous related work in Huang et al, ACL 2009: Profile Based Cross-Document Coreference Using Kernelized Fuzzy Relational Clustering. They report a B-cubed score of 0.775.

Conclusion

The authors present a novel way to incorporate document-level categories and sentence-level context as features for the problem of Cross Document Coreference (CDC).

Relevant Papers