Difference between revisions of "Barzilay and Elhadad, 2003"
(4 intermediate revisions by the same user not shown) | |||
Line 27: | Line 27: | ||
Within-corpus paragraph clustering is performed by [[UsesMethod:Cosine similarity]]. Dates, numbers, and proper nouns (considered anything starting with a capital letter) are redacted to force the similarity to be based on non-content-based words. The number of clusters is tuned on training data. | Within-corpus paragraph clustering is performed by [[UsesMethod:Cosine similarity]]. Dates, numbers, and proper nouns (considered anything starting with a capital letter) are redacted to force the similarity to be based on non-content-based words. The number of clusters is tuned on training data. | ||
+ | |||
+ | [[File:Barzilay_clustering.png]] | ||
=== Horizontal Paragraph Mapping (Training) === | === Horizontal Paragraph Mapping (Training) === | ||
Line 44: | Line 46: | ||
<math> | <math> | ||
sim(i, j-1) - skip penalty | sim(i, j-1) - skip penalty | ||
+ | </math> | ||
+ | |||
+ | <math> | ||
sim(i-1, j) - skip penalty | sim(i-1, j) - skip penalty | ||
+ | </math> | ||
+ | |||
+ | <math> | ||
sim(i-1, j-1) + sim(i,j) | sim(i-1, j-1) + sim(i,j) | ||
+ | </math> | ||
+ | |||
+ | <math> | ||
sim(i-1, j-2) + sim(i,j) + sim(i, j-1) | sim(i-1, j-2) + sim(i,j) + sim(i, j-1) | ||
+ | </math> | ||
+ | |||
+ | <math> | ||
sim(i-2, j-1) + sim(i,j) + sim(i-1, j) | sim(i-2, j-1) + sim(i,j) + sim(i-1, j) | ||
+ | </math> | ||
+ | |||
+ | <math> | ||
sim(i-2, j-2) + sim(i,j-1) + sim(i-1, j) | sim(i-2, j-2) + sim(i,j-1) + sim(i-1, j) | ||
</math> | </math> | ||
== Evaluation == | == Evaluation == | ||
+ | |||
+ | 103 pairs of encyclopedia articles were used in this system; 20 pairs were manually aligned for sentences, and 83 pairs were only segmented into paragraphs. 92 document pairs were were used for vertical paragraph clustering, of which 9 were from the sentence aligned set; training of the paragraph-paragraph mapping was performed on these nine pairs. 11 document pairs were held out for testing. | ||
+ | |||
+ | [[File:Barzilay_results.png]] | ||
+ | |||
+ | As is common, a tradeoff exists between precision and recall. The best F-score is obtained with a precision of .769 and a recall of .558. This is a promising result given the very small amount of annotated data (only nine document pairs were annotated at training time!). |
Latest revision as of 07:10, 30 September 2011
Contents
Citation
Regina Barzilay and Noemie Elhadad. Sentence Alignment for Monolingual Comparable Corpora. In Proceedings of the 2003 Conference on Empirical Methods in Natural Language Processing. Online Link
Summary
This paper studies the problem of aligning documents at the sentence level when they are on the same topic or are describing the same event, but were written independently. This is a common situation in newswire text, for instance, where a variety of sources will report on a single story, but will all be written separately. This is not a one-to-one mapping problem, as a topic may be only briefly mentioned in one document but detailed extensively in another.
The approach used here is has multiple components, first clustering paragraphs within-corpus, then aligning documents at the paragraph level (essentially marking candidate sentence-sentence pairs), then finally performing sentence-level alignment.
Motivation
Multi-document summarization is the primary application of this work. Knowing that multiple sentences across documents describe the same event is a helpful step for generating an extractive summary.
The approach detailed here is particularly designed for the purpose of finding "topics" in a single corpus. However, the term topic here is distinctly different from its common usage in NLP, and is more closely related to "function". For instance, in a medical corpus, "topics" that the authors wish to detect would be "symptoms", "treatment", etc.
Algorithms
The algorithm, as training data, takes two parallel corpora describing the same sets of events. This training data is already aligned at the sentence level. Then, four steps are taken:
- First, paragraphs are clustered in each corpus's training data independently. Output of this step is a cluster assignment for each paragraph in each document in training data.
- Next, mappings from clusters in corpus 1 to clusters in corpus 2 are computed from training data.
- In testing data, paragraphs are passed through the first two steps, resulting in each paragraph in corpus 1 being assigned a cluster label and mapped to a set of candidate paragraphs in corpus 2.
- Finally, for each paragraph-paragraph mapping, all possible sentence alignment pairs are independently tested using similarity metrics.
Vertical Paragraph Clustering (Training)
Within-corpus paragraph clustering is performed by UsesMethod:Cosine similarity. Dates, numbers, and proper nouns (considered anything starting with a capital letter) are redacted to force the similarity to be based on non-content-based words. The number of clusters is tuned on training data.
Horizontal Paragraph Mapping (Training)
Cluster mapping from corpus 1 to corpus 2 takes as input (paragraph in corpus 1, paragraph in corpus 2) pairs and considers them to be aligned if they contain at least one aligned pair of sentences (remember, training data is already aligned at the sentence level). Defining mappings between clusters then uses a very small number of features - lexical similarity between paragraphs, and the paragraph cluster of both paragraphs in the pair being considered. Because of the small number of weakly predictive features, learning is performed using UsesMethod:Boosting.
Macro Alignment (Testing)
At testing time, a pair of documents, one from each corpus, is given. Each paragraph in each document is passed through the models trained in the first two steps, resulting in a many-to-many macro-level paragraph alignment.
Sentence Alignment (Testing)
From each paragraph-paragraph pair, all possible combinations of sentence-sentence pairs are tested as possible alignment pairs. Each pair is computed for lexical similarity. Then, a dynamic programming approach is taken over the whole paragraph-paragraph pair. The motivation for this formulation is to prevent the system from only basing its judgments on lexical overlap., allowing moderately similar sentences to be aligned if they are adjacent to a very high similarity pair of sentences. This dynamic programming algorithm chooses, for a sentence-sentence pair, the maximum score from the following possible scores:
For a sentence pair (i,j) representing the ith sentence of paragraph 1 and the jth sentence of paragraph 2,
Evaluation
103 pairs of encyclopedia articles were used in this system; 20 pairs were manually aligned for sentences, and 83 pairs were only segmented into paragraphs. 92 document pairs were were used for vertical paragraph clustering, of which 9 were from the sentence aligned set; training of the paragraph-paragraph mapping was performed on these nine pairs. 11 document pairs were held out for testing.
As is common, a tradeoff exists between precision and recall. The best F-score is obtained with a precision of .769 and a recall of .558. This is a promising result given the very small amount of annotated data (only nine document pairs were annotated at training time!).