Within Document Coreference (WDC)

From Cohen Courses
Jump to navigationJump to search

This is a technical problem related to one of the term projects in Information Extraction 10-707 in Fall 2010.

Within Document Coreference (WDC), also known as Coreference Resolution, is the task of extracting all the noun phrases in a document, and clustering them according to the real-world entity that they refer to.

History

This problem was introduced and extensively studied at the MUC (MUC-6, MUC-7) and ACE (ACE2, ACE 2004, ACE 2005, ACE 2007, ACE 2008) series of conferences and evaluations

Details

The noun phrases in a document can be categorized as being one of the three types of mentions:

  • Named Mentions (e.g. Bill Gates)
  • Nominal Mentions (e.g. Chairman of Microsoft)
  • Pronouns

The task of general noun-phrase WDC involves forming clusters of such mentions such that each cluster contains all and only those mentions that refer to the same real-world entity. A subset of this task called Named Mention Coreference, deals with only extracting those clusters that contain at least one named mention (and throwing away nameless clusters, e.g. "some company").

Difficulties

Unlike Named Entity Recognition where the state of the art systems achieve F1 scores of greater than 0.9, the state-of-art for coreference resolution is still in the region of 0.5 - 0.6. This is because of the difficulties associated with extracting nominal mentions and pronouns in addition to named entities, and also because clustering mentions into entities requires a detailed knowledge of the syntactic structure and semantics of the document.

Approaches

Current state-of-the-art approaches include both rule-based and machine learning algorithms. The rule-based approaches apply inductive logic programming, which combines rules for co-reference resolution in a logic induction framework. Other researchers use Markov logic networks with a probabilistic version of logic induction (Culotta et al.,2007). Many researchers have explored machine learning approaches by treating the problem as a pair-wise binary classification problem with subsequent entity clustering or a joint model of classification and clustering (Soon et al., 2001; Ng & Cardie, 2002; Ng, 2005; Haghighi & Klein 2007; Ng, 2008; Finkel & Manning, 2008). The most recent work (Haghighi & Klein, 2009) focuses on feature analysis with a simple model for co-reference resolution. With its rich set of syntactic and semantic features, it is reported to outperform the current state-of-the-art systems.

Metrics

Two common metrics, and CEAF are used to measure the performance of CRR systems. takes the weighted sum of F-measures for each individual mention, while CEAF does the same but also imposes an additional constraint that one true cluster can be mapped to at most one output cluster and vice versa.

Applications

Solving the problem of within document coreference is a vital first step for the more advanced problem of Cross Document Coreference (CDC), which in turn is vital for applications such as entity tracking, person name disambiguation, alias identification, etc.

Toolkits

Currently, the UIUC LBJ system and Johns Hopkins' BART system represent the state-of-the-art for within document coreference resolution. The UIUC system uses rich features for co-reference resolution (Bengtson & Roth, 2008) and is distributed as a Learning-Based Java (LBJ) co-reference package. The BART system is from the Johns Hopkins University summer workshop on using lexical and encyclopedic knowledge for entity disambiguation (Versley et al., 2008).

Relevant Papers