Within Document Coreference (WDC)

From Cohen Courses
Revision as of 18:34, 27 September 2010 by PastStudents (talk | contribs)
Jump to navigationJump to search

This is a technical problem related to one of the term projects in Information Extraction 10-707 in Fall 2010.

Within Document Coreference (WDC), also known as Coreference Resolution, is the task of extracting all the noun phrases in a document, and clustering them according to the real-world entity that they refer to.

History

This problem was introduced and extensively studied at the MUC and ACE series of conferences and evaluations

Details

The noun phrases in a document can be categorized as being one of the three types of mentions:

  • Named Mentions (e.g. Bill Gates)
  • Nominal Mentions (e.g. Chairman of Microsoft)
  • Pronouns

The task of general noun-phrase WDC involves forming clusters of such mentions such that each cluster contains all and only those mentions that refer to the same real-world entity. A subset of this task called Named Mention Coreference, deals with only extracting those clusters that contain at least one named mention (and throwing away nameless clusters, e.g. "some company").

Difficulties

Unlike Named Entity Recognition where the state of the art systems achieve F1 scores of greater than 0.9, the state-of-art for coreference resolution is still in the region of 0.5 - 0.6. This is because clustering nominal mentions and pronouns correctly requires in-depth knowledge of the semantics of the document.

Approaches

Current approaches to WDC treat it as a two-stage problem: 1) classifying pairs of mentions as being coreferring or not and 2)clustering them according to the results of this classification. Both rule-based as well as machine-learning based WDC systems exist.

Metrics

Two common metrics, and CEAF are used to measure the performance of CRR systems. takes the weighted sum of F-measures for each individual mention, while CEAF does the same but also imposes an additional constraint that one true cluster can be mapped to at most one output cluster and vice versa.

Applications

Solving the problem of within document coreference is a vital first step for the more advanced problem of cross-document coreference, which in turn is vital for applications such as entity tracking, person name disambiguation, alias identification, etc.

Toolkits

Currently, the UIUC LBJ system and Johns Hopkins' BART system represent the state-of-the-art for within document coreference resolution.

Relevant Papers