Rahman and Ng, ACL 2011
Contents
Citation
Rahman, Altaf and Ng, Vincent. 2011. Coreference Resolution with World Knowledge. In Proceedings of the 49th annual meeting of the Association for Computational Linguistics, 814-824.
Online Version
http://www.hlt.utdallas.edu/~vince/papers/acl11.html
Summary
This Paper addresses the problem of coreference resolution. The premise is that systems that try to solve the coreference problem have typically relied on linguistic knowledge encoded into the algorithms. However, they have largely ignored world knowledge, which the authors argue is an important type of knowledge for determining the antecedent of anaphoric noun phrases. The authors address this gap in the literature by augmenting existing coreference resolution systems with features derived from several sources of world knowledge.
Methods
They use three main sources for world knowledge. The first is knowledge bases, where they used YAGO and FrameNet. From YAGO they extract ISA kinds of relationships (such as Albert Einstein is a physicist) and MEANS relationships (such as Einstein means Albert Einstein and Einstein means Alfred Einstein). When building a classifier to select among possible antecedents, one of the features they use is that the pair of noun phrases considered appears in one of these two relationships in YAGO. From FrameNet they get information about verbs that have similar meaning, so that if two noun phrases appear as the same argument in related verbs, they are more likely to be coreferent.
The second source of world knowledge that they make use of is annotated data. The assumption is that humans used world knowledge to solve the coreference problem when they annotated the data, so if some good features can be learned from the data it would approximate "world knowledge." They extract features from this annotated data that consist of pairs of words seen as coreferent in the training data (such as, "Barack Obama" was seen as coreferent to "president" in the training data, so if we see those two noun phrases in the testing data we will be more likely to classify them as coreferent). They also do some fancy treatment of unseen words, artificially adding unseen words to the training data to be able to better classify them (apparently there were a lot of unseen noun phrases in the test data).
The last source of world knowledge is from large amounts of unlabeled data. They used a database of coreferent noun phrases that were syntactically extracted from a large body of text, mostly through the use of appositives (things like "Barack Obama, the president of the United States, said...").
They used two baseline algorithms for performing the actual coreference resolution. One was a mention pair classifier. This classifier starts at the current noun phrase and looks backward at all preceding noun phrases, classifying each one as antecedent or not, until it finds a noun phrase that it labels as antecedent or it reaches the beginning of the document.
The second method was an improvement on the simple mention pair classifier which builds clusters of mentions as it goes. Instead of classifying mention pairs individually, it ranks all of the known clusters seen so far in the document, along with a cluster containing only the current noun phrase, and picks the best matching cluster according to the classifier.
Data sets
They used two datasets in their experiments, the ACE 2004/2005 dataset and the OntoNotes-2 dataset. These two datasets are annotated differently, so they used only those documents that were common to both, so they could evaluate with both sets of annotations on the same data.
Experimental Results
They show that adding features from world knowledge improves the performance of both methods on the coreference resolution problem. The features were additive, in that using all of them was better than any of set of features individually, and all of them improved performance at least a little bit. The kinds of features that helped the most when added individually were the YAGO features and the noun pair features obtained from labeled coreference data.