Difference between revisions of "Berger et al 1996 a maximum entropy approach to natural language processing"

Revision as of 00:25, 29 September 2011

Citation

Adam Berger, Stephen Della Pietra, and Vincent Della Pietra; Computational Linguistics, (22-1), March 1996;

Online Version

An online version is located at [1]

Summary

This oft-cited paper explains the concept of Maximum Entropy Models and relates them to natural language processing, specifically as they can be applied to Machine Translation

Explanation and Discussion

Maximum Entropy

The paper goes into a fairly detailed explanation of the motivation behind Maximum Entropy Models. They divide it into 2 sub-problems: Finding facts about the data, and incorporating the facts into the model. These facts are the "features" of the data.

Experiments, Method, and Data

The case-study they introduce in the paper is one involving Machine Translation, translating French sentences to English. The goal of the paper is to use the described maximum entropy model to augment the basic translation model. The model introduces the concept of alignments, which yields both a sequence of words, as well as a mapping from the input sequence to the output sequence. They also include experiments for using maximum entropy models for two other tasks: segmentation, and reordering words.

Translation Model

Model

The model is designed to find:

${\bar {E}}={\underset {E}{\operatorname {argmax} }}\left[p(F|E)p(E)\right]$

Where ${\bar {E}}$ is the best English translation for the French sequence of words $F$ . $P(F|E)$ can be defined as a sum of the probabilities of all possible alignments $A$ of $E$ and $F$ .

This is defined as the translation model. Their initial model for computing the probability of an alignment $A$ of $E$ and $F$ is given as:

$P(F,A|E)=\prod _{i=1}^{|E|}{p(n(e_{i})|e_{i})}*\prod _{j=1}^{|F|}{p(y_{j}|e_{a_{j}})}*d(A|E,F)$

The first term is the product of the probabilities that a given English word $e_{i}$ produces $n$ French words. The second term is the product that the given English word produces the French word $y_{j}$ , and the final term is the probability of the ordering of the French words.

The drawback with the model is that it does not use any context in its usage. Their solution is to train a maximum entropy model for each English word $e$ such that it produces a French word $y$ based on some context $x$ . The new model is:

$P(F,A|E)=\prod _{i=1}^{|E|}{p(n(e_{i})|e_{i})}*\prod _{j=1}^{|F|}{p_{e_{a_{j}}}(y_{j}|x_{a_{j}})}*d(A|E,F)$

Results

Other Models

They describe using a feature-based model for segmenting phrases. This is intended to determine "appropriate" places in the text for running the translation model, by computing a probability of a rift at every given position in the sequence of words. They use various features, including part of speech, and train on data from an "expert" French segmenter. They use dynamic programming to produce the segments of data. They don't provide much empirical evidence, other than to show the log-likelihood of the training, and an example of segmented text.

The last model they describe is for re-ordering words. This is useful because often times, though the words are aligned properly, the English ordering is supposed to be different from the French ordering. They examine one specific case, NOUN de NOUN. They train the model on 10,000 different instances, and compare the maximum entropy model for swapping with one that does no swapping. They tested the results on 71,555 instances

Example (count)	Simple Model Accuracy	Maximum Entropy Model Accuracy
Non-interchanged (50,229)	100%	93.5%
Interchanged (21,326)	0%	49.2%
Total (71,555)	70.2%	80.4%

Related Work

As this was one of the earliest works in maximum entropy models as they're related to natural language processing, it is often used as background knowledge for other maximum entropy papers, including MEMMs. A few papers follow:

@@ Line 68: / Line 68: @@
 | 80.4%
 |}
+== Related Work ==
+As this was one of the earliest works in maximum entropy models as they're related to natural language processing, it is often used as background knowledge for other maximum entropy papers, including MEMMs. A few papers follow:
+* [[RelatedPaper::Frietag 2000 Maximum Entropy Markov Models for Information Extraction and Segmentation]]
+* [[RelatedPaper::Klein 2002 conditional structure versus conditional estimation in nlp models]]

Difference between revisions of "Berger et al 1996 a maximum entropy approach to natural language processing"

Revision as of 00:25, 29 September 2011

Contents

Citation

Online Version

Summary

Explanation and Discussion

Maximum Entropy

Experiments, Method, and Data

Translation Model

Model

Results

Other Models

Related Work

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

Tools