Barzilay and Elhadad, 2003
Contents
Citation
Regina Barzilay and Noemie Elhadad. Sentence Alignment for Monolingual Comparable Corpora. In Proceedings of the 2003 Conference on Empirical Methods in Natural Language Processing. Online Link
Summary
This paper studies the problem of aligning documents at the sentence level when they are on the same topic or are describing the same event, but were written independently. This is a common situation in newswire text, for instance, where a variety of sources will report on a single story, but will all be written separately. This is not a one-to-one mapping problem, as a topic may be only briefly mentioned in one document but detailed extensively in another.
The approach used here is has multiple components, first clustering paragraphs within-corpus, then aligning documents at the paragraph level (essentially marking candidate sentence-sentence pairs), then finally performing sentence-level alignment.
Motivation
Multi-document summarization is the primary application of this work. Knowing that multiple sentences across documents describe the same event is a helpful step for generating an extractive summary.
The approach detailed here is particularly designed for the purpose of finding "topics" in a single corpus. However, the term topic here is distinctly different from its common usage in NLP, and is more closely related to "function". For instance, in a medical corpus, "topics" that the authors wish to detect would be "symptoms", "treatment", etc.
Algorithms
The algorithm, as training data, takes two parallel corpora describing the same sets of events. This training data is already aligned at the sentence level. Then, four steps are taken:
- First, paragraphs are clustered in each corpus's training data independently. Output of this step is a cluster assignment for each paragraph in each document in training data.
- Next, mappings from clusters in corpus 1 to clusters in corpus 2 are computed from training data.
- In testing data, paragraphs are passed through the first two steps, resulting in each paragraph in corpus 1 being assigned a cluster label and mapped to a set of candidate paragraphs in corpus 2.
- Finally, for each paragraph-paragraph mapping, all possible sentence alignment pairs are independently tested using similarity metrics.