Difference between revisions of "Barzilay and Elhadad, 2003"

From Cohen Courses
Jump to navigationJump to search
(Created page with '== Citation == Regina Barzilay and Noemie Elhadad. '''Sentence Alignment for Monolingual Comparable Corpora'''. In ''Proceedings of the 2003 Conference on Empirical Methods in N…')
 
Line 4: Line 4:
  
 
== Summary ==
 
== Summary ==
 +
 +
This paper studies the problem of aligning documents at the sentence level when they are on the same topic or are describing the same event, but were written independently. This is a common situation in newswire text, for instance, where a variety of sources will report on a single story, but will all be written separately. This is not a one-to-one mapping problem, as a topic may be only briefly mentioned in one document but detailed extensively in another.
 +
 +
The approach used here is has multiple components, first clustering paragraphs within-corpus, then aligning documents at the paragraph level (essentially marking candidate sentence-sentence pairs), then finally performing sentence-level alignment.
  
 
== Motivation ==
 
== Motivation ==
 +
 +
Multi-document summarization is the primary application of this work. Knowing that multiple sentences across documents describe the same event is a helpful step for generating an extractive summary.
 +
 +
The approach detailed here is particularly designed for the purpose of finding "topics" in a single corpus. However, the term topic here is distinctly different from its common usage in NLP, and is more closely related to "function". For instance, in a medical corpus, "topics" that the authors wish to detect would be "symptoms", "treatment", etc.
  
 
== Algorithms ==
 
== Algorithms ==
 +
 +
The algorithm, as training data, takes two parallel corpora describing the same sets of events. This training data is already aligned at the sentence level. Then, four steps are taken:
 +
 +
* First, paragraphs are clustered in each corpus's training data independently. Output of this step is a cluster assignment for each paragraph in each document in training data.
 +
* Next, mappings from clusters in corpus 1 to clusters in corpus 2 are computed from training data.
 +
* In testing data, paragraphs are passed through the first two steps, resulting in each paragraph in corpus 1 being assigned a cluster label and mapped to a set of candidate paragraphs in corpus 2.
 +
* Finally, for each paragraph-paragraph mapping, all possible sentence alignment pairs are independently tested using similarity metrics.
  
 
=== Vertical Paragraph Clustering (Training) ===
 
=== Vertical Paragraph Clustering (Training) ===

Revision as of 07:47, 30 September 2011

Citation

Regina Barzilay and Noemie Elhadad. Sentence Alignment for Monolingual Comparable Corpora. In Proceedings of the 2003 Conference on Empirical Methods in Natural Language Processing. Online Link

Summary

This paper studies the problem of aligning documents at the sentence level when they are on the same topic or are describing the same event, but were written independently. This is a common situation in newswire text, for instance, where a variety of sources will report on a single story, but will all be written separately. This is not a one-to-one mapping problem, as a topic may be only briefly mentioned in one document but detailed extensively in another.

The approach used here is has multiple components, first clustering paragraphs within-corpus, then aligning documents at the paragraph level (essentially marking candidate sentence-sentence pairs), then finally performing sentence-level alignment.

Motivation

Multi-document summarization is the primary application of this work. Knowing that multiple sentences across documents describe the same event is a helpful step for generating an extractive summary.

The approach detailed here is particularly designed for the purpose of finding "topics" in a single corpus. However, the term topic here is distinctly different from its common usage in NLP, and is more closely related to "function". For instance, in a medical corpus, "topics" that the authors wish to detect would be "symptoms", "treatment", etc.

Algorithms

The algorithm, as training data, takes two parallel corpora describing the same sets of events. This training data is already aligned at the sentence level. Then, four steps are taken:

  • First, paragraphs are clustered in each corpus's training data independently. Output of this step is a cluster assignment for each paragraph in each document in training data.
  • Next, mappings from clusters in corpus 1 to clusters in corpus 2 are computed from training data.
  • In testing data, paragraphs are passed through the first two steps, resulting in each paragraph in corpus 1 being assigned a cluster label and mapped to a set of candidate paragraphs in corpus 2.
  • Finally, for each paragraph-paragraph mapping, all possible sentence alignment pairs are independently tested using similarity metrics.

Vertical Paragraph Clustering (Training)

Horizontal Paragraph Mapping (Training)

Macro Alignment (Testing)

Sentence Alignment (Testing)

Evaluation