Singh et al., ACL 2011
Sameer Signh, Armarnag Subramanya, Fernando Pereira, and Andrew McCallum. Large-Scale Cross-Document Coreference Using Distributed Inference and Hierarchical Models. ACL 2011.
This Paper paper deals with coreference resolution, but instead of only dealing with one document at a time, they want to model coreference relations across an entire corpus (so this is actually dealing with Cross Document Coreference (CDC)).
Coreference resolution amounts to partitioning a set of "mentions" in a document (typically noun phrases, but also possessive and other kinds of adjectives). The number of partitions of the elements of a set is enormously large (the Bell number, which is worse than exponential). When you get to cross-document coreference on large corpora, the number of mentions can be in the millions or billions, and thus an exact search over partitions of them is completely intractable.
To find a good answer in this combinatorial space, the authors start with a good guess at a partition and then use MCMC to sample one mention at a time and find a good approximate solution. That's not very surprising or interesting by itself; the interesting part is the set of tricks they use to make MCMC work better on really large collections of documents (and this is work done while Singh was a Google intern - we're talking about really large sets of documents).
Their model is essentially a really large factor graph over all of the mentions in the corpus; every mention has two factors connecting it to every other mention, one of which falls out of the equations (if the mentions are coreferent, one of them fires, if they aren't, the other one does). All they say about those factors is that it's a cosine similarity function over the mention context pairs. This model would be really hairy, but they use Metropolis-Hastings sampling, and all they have to compute is a likelihood ratio between the current state and the proposed state. In that equation all factors except those between mentions in the old and new entity cancel each other out.
Their proceeds as follows. At every step of the algorithm, pick a mention. Draw a new entity for that mention from a proposal distribution (they start out with this as a uniform distribution). Calculate the likelihood ratio and do the typical Metropolis-Hastings accept or reject (they throw some annealing in, as well). Keep this up for long enough, and you're done.
The problem with this naive system is that the proposal distribution almost never actually gives you a good proposal, so it has to run for a really long time. In order to fix it they introduce two kinds of latent variables. The first are sub-entity variables that collect similar mentions to each other (these could, for instance, collect all of the coreferent mentions in a single document). These variables can then be sampled to move the entire collection from one entity to another at a time, similar (in desired outcome, at least) to Percy Liang's type-based MCMC paper. The second kind of variables are super-entity variables that group similar entities together onto a single machine, so that proposed swappings are more reasonable and more easily calculated. They resample assignments to these latent variables as part of their inference process.
There isn't really a good dataset for this kind of task, so they created one. They crawled the web and found pages that linked to Wikipedia, and they considered all mentions that link to the same Wikipedia article to be coreferent. They did some filtering of the links they founds to try to improve the accuracy of their data a bit, as it was understandably noisy. When they ran their (unsupervised) system on the data, they got a score that was really pretty high. I wonder, though, if that has to do with how they cleaned the links to Wikipedia - the way the guaranteed high precision in their testing labels was to throw away the hard cases, essentially, so of course their performances on those mentions should be relatively high. Still, it's pretty impressive to have such high performance on such a difficult problem on a really large dataset.