Difference between revisions of "Within Document Coreference (WDC)"
PastStudents (talk | contribs) |
PastStudents (talk | contribs) |
||
Line 4: | Line 4: | ||
==History== | ==History== | ||
− | This problem was introduced and extensively studied at the MUC (MUC-6, MUC-7) and ACE (ACE2, ACE 2004, [[ACE 2005|ACE 2005 | + | This problem was introduced and extensively studied at the MUC (MUC-6, MUC-7) and ACE (ACE2, ACE 2004, [[ACE 2005 Dataset |ACE 2005]], ACE 2007, ACE 2008) series of conferences and evaluations |
==Details== | ==Details== |
Revision as of 01:07, 28 September 2010
This is a technical problem related to one of the term projects in Information Extraction 10-707 in Fall 2010.
Within Document Coreference (WDC), also known as Coreference Resolution, is the task of extracting all the noun phrases in a document, and clustering them according to the real-world entity that they refer to.
Contents
History
This problem was introduced and extensively studied at the MUC (MUC-6, MUC-7) and ACE (ACE2, ACE 2004, ACE 2005, ACE 2007, ACE 2008) series of conferences and evaluations
Details
The noun phrases in a document can be categorized as being one of the three types of mentions:
- Named Mentions (e.g. Bill Gates)
- Nominal Mentions (e.g. Chairman of Microsoft)
- Pronouns
The task of general noun-phrase WDC involves forming clusters of such mentions such that each cluster contains all and only those mentions that refer to the same real-world entity. A subset of this task called Named Mention Coreference, deals with only extracting those clusters that contain at least one named mention (and throwing away nameless clusters, e.g. "some company").
Difficulties
Unlike Named Entity Recognition where the state of the art systems achieve F1 scores of greater than 0.9, the state-of-art for coreference resolution is still in the region of 0.5 - 0.6. This is because of the difficulties associated with extracting nominal mentions and pronouns in addition to named entities, and also because clustering mentions into entities requires a detailed knowledge of the syntactic structure and semantics of the document.
Approaches
Current state-of-the-art approaches include both rule-based and machine learning algorithms. The rule-based approaches apply inductive logic programming, which combines rules for co-reference resolution in a logic induction framework. Other researchers use Markov logic networks with a probabilistic version of logic induction (Culotta et al.,2007). Many researchers have explored machine learning approaches by treating the problem as a pair-wise binary classification problem with subsequent entity clustering or a joint model of classification and clustering (Soon et al., 2001; Ng & Cardie, 2002; Ng, 2005; Haghighi & Klein 2007; Ng, 2008; Finkel & Manning, 2008). The most recent work (Haghighi & Klein, 2009) focuses on feature analysis with a simple model for co-reference resolution. With its rich set of syntactic and semantic features, it is reported to outperform the current state-of-the-art systems.
Metrics
Two common metrics, and CEAF are used to measure the performance of CRR systems. takes the weighted sum of F-measures for each individual mention, while CEAF does the same but also imposes an additional constraint that one true cluster can be mapped to at most one output cluster and vice versa.
Applications
Solving the problem of within document coreference is a vital first step for the more advanced problem of Cross Document Coreference (CDC), which in turn is vital for applications such as entity tracking, person name disambiguation, alias identification, etc.
Toolkits
Currently, the UIUC LBJ system and Johns Hopkins' BART system represent the state-of-the-art for within document coreference resolution. The UIUC system uses rich features for co-reference resolution (Bengtson & Roth, 2008) and is distributed as a Learning-Based Java (LBJ) co-reference package. The BART system is from the Johns Hopkins University summer workshop on using lexical and encyclopedic knowledge for entity disambiguation (Versley et al., 2008).