Within Document Coreference (WDC)
This is a technical problem related to one of the term projects in Information Extraction 10-707 in Fall 2010.
Within Document Coreference (WDC), also known as Coreference Resolution, is the task of extracting all the noun phrases in a document, and clustering them according to the real-world entity that they refer to.
Contents
History
This problem was introduced and extensively studied at the MUC and ACE series of conferences and evaluations
Details
The noun phrases in a document can be categorized as being one of the three types of mentions:
- Named Mentions (e.g. Bill Gates)
- Nominal Mentions (e.g. Chairman of Microsoft)
- Pronouns
The task of general noun-phrase WDC involves forming clusters of such mentions such that each cluster contains all and only those mentions that refer to the same real-world entity. A subset of this task called Named Mention Coreference, deals with only extracting those clusters that contain at least one named mention (and throwing away nameless clusters, e.g. "some company").
Difficulties
Unlike Named Entity Recognition where the state of the art systems achieve F1 scores of greater than 0.9, the state-of-art for coreference resolution is still in the region of 0.5 - 0.6. This is because clustering nominal mentions and pronouns correctly requires in-depth knowledge of the semantics of the document.
Approaches
Current approaches to WDC treat it as a two-stage problem: 1) classifying pairs of mentions as being coreferring or not and 2)clustering them according to the results of this classification. Both rule-based as well as machine-learning based WDC systems exist.
Metrics
Two common metrics, and CEAF are used to measure the performance of CRR systems. takes the weighted sum of F-measures for each individual mention, while CEAF does the same but also imposes an additional constraint that one true cluster can be mapped to at most one output cluster and vice versa.
Applications
Solving the problem of within document coreference is a vital first step for the more advanced problem of cross-document coreference, which in turn is vital for applications such as entity tracking, person name disambiguation, alias identification, etc.
Toolkits
Currently, the UIUC LBJ system and Johns Hopkins' BART system represent the state-of-the-art for within document coreference resolution.