Cross Document Coreference (CDC)

From Cohen Courses
Jump to navigationJump to search

This is a technical problem related to one of the term projects in Information Extraction 10-707 in Fall 2010.

Cross Document Coreference (CDC) is the task of extracting all the noun phrases from all the documents in a corpus, and clustering them according to the real-world entity that they refer to.

History

The problem was studied at the ACE 2008 evaluation. In addition, the WePS workshops study the closely related problem of person name disambiguation, where the task is to correctly classify web pages about different people having the same name. There exist few datasets for cross-document coreference. Some of them are the ACE 2008 corpus, the John Smith Corpus by Bagga and Baldwin, and WePS workshop corpora.

Details

The noun phrases in a document can be categorized as being one of the three types of mentions:

  • Named Mentions (e.g. Barack Obama)
  • Nominal Mentions (e.g. The President)
  • Pronouns

The task of general noun-phrase Within Document Coreference (WDC) involves forming clusters of such mentions such that each cluster contains all and only those mentions that refer to the same real-world entity. In cross-document coreference, an additional layer of complexity is introduced: clusters from different documents must also be resolved as describing the same real-world entity or not.

Difficulties

Some of the difficulties associated with CDC as opposed to WDC include distinguishing between different entities having the same names, or grouping together clusters for the same entity that might be referred to by different names in different documents (alias identification). There has been only one ACE evaluation (2008) that addressed this problem and unsurprisingly, the results of CDC systems were significantly worse than those of WDC systems.

Approaches

Bagga and Baldwin, 1998 introduced the first approach to solving this problem. They modified the standard vector space model from information retrieval to compute TFIDF type features to classify chains of mentions obtained from a WDC system as being coreferent or not, and then clustered these chains according to the classifier's results. Mann and Yarowsky, 2003 perform agglomerative clustering to do cross-document person name disambiguation. Fleischman et al, 2004 solve the problem of resolving whether two descriptions (concepts) refer to the same entity (instance) or not, by adapting a two stage maxent classifier and clustering approach. Huang et al, 2009 tackle the general problem of cross-document coreference by first using a within document information extraction system called AeroText to get entities and their attributes, and then performing density clustering. Baron et al, 2008 use a similar IE system by BBN called SERIF to do within document information extraction, and then perform a 3 step HAC clustering. Finally, Mayfield, 2009 use a rich feature set, perform pairwise classification using SVMs and simple clustering thereafter.

Metrics

Unlike WDC systems whose performance is usually described in terms of and CEAF, CDC systems are also described using clustering-based metrics such as purity, Normalized Distributed Information Gain (NDMI), etc.

Applications

Cross document coreference is vital for applications such as entity tracking, person name disambiguation, alias identification, etc.

Toolkits

There are currently no freely available toolkits that perform cross-document coreference.

Relevant Papers