Within Document Coreference (WDC)

From Cohen Courses
Revision as of 19:26, 27 September 2010 by PastStudents (talk | contribs) (Created page with 'This is a technical [[category::problem]] related to one of the term projects in Information Extraction 10-707 in Fall 2010. Within Document Coreference (WDC), also known as…')
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigationJump to search

This is a technical problem related to one of the term projects in Information Extraction 10-707 in Fall 2010.

Within Document Coreference (WDC), also known as Coreference Resolution, is the task of extracting all the noun phrases in a document, and clustering them according to the real-world entity that they refer to.

History

This problem was introduced and extensively studied at the MUC and ACE series of conferences and evaluations

Details

The noun phrases in a document can be categorized as being one of the three types of mentions:

  • Named Mentions (e.g. Bill Gates)
  • Nominal Mentions (e.g. Chairman of Microsoft)
  • Pronouns

The task of general noun-phrase WDC involves forming clusters of such mentions such that each cluster contains all and only those mentions that refer to the same real-world entity. A subset of this task called Named Mention Coreference, deals with only extracting those clusters that contain at least one named mention (and throwing away nameless clusters, e.g. "some company").

Difficulties

Unlike Named Entity Recognition where the state of the art systems achieve F1 scores of greater than 0.9, the state-of-art for coreference resolution is still in the region of 0.5 - 0.6. This is because clustering nominal mentions and pronouns correctly requires in-depth knowledge of the semantics of the document.

Applications

Solving the problem of within document coreference is a vital first step for the more advanced problem of cross-document coreference, which in turn is vital for applications such as entity tracking, person name disambiguation, alias identification, etc.

Toolkits

Currently, the UIUC LBJ system and Johns Hopkins' BART system represent the state-of-the-art for within document coreference resolution.

Relevant Papers