Berger et al 1996 a maximum entropy approach to natural language processing

Citation

"A Maximum-Entropy Approach to Natural Language Processing", Adam Berger, Stephen Della Pietra, and Vincent Della Pietra; Computational Linguistics, (22-1), March 1996;

Online Version

An online version is located at [1]

Summary

This oft-cited paper explains the concept of Maximum Entropy Models and relates them to natural language processing, specifically as they can be applied to Machine Translation

Explanation and Discussion

Maximum Entropy

The paper goes into a fairly detailed explanation of the motivation behind Maximum Entropy Models. They divide it into 2 sub-problems: Finding facts about the data, and incorporating the facts into the model. These facts are the "features" of the data. There is a lot of discussion in the paper of the math of the maximum entropy model. One piece of justification they use is the fact that the maximum entropy model can also be shown to be the model that, of all the parametric form models, best fits the training data (i.e. maximizes likelihood of the training data).

Iterative Scaling

They define an improved Iterative Scaling for computing the parameters (feature weights) of the maximum entropy model:

Input: Feature functions $f_{1},f_{2},...,f_{n}$ , Empirical distribution: ~ $(x,y)$

Output: Optimal parameter values $\lambda _{i}^{*}$ , Optimal model $p_{\lambda ^{*}}$

Algorithm:

Initialize all $\lambda _{i}$ to 0
Compute and update each lambda
- $\lambda _{i}\leftarrow \lambda _{i}+\Delta \lambda _{i}$
- $\Delta \lambda _{i}$ is the solution to $\sum _{x,y}{{\hat {p}}(x)p(y|x)f_{i}(x,y)\operatorname {exp} (\Delta \lambda _{i}f^{\#}(x,y))={\hat {p}}(f_{i})}$
- $f^{\#}(x,y)$ is the sum of the features for $x$ and $y$
Repeat to convergence

Feature Selection

The paper defines a basic method for Feature Selection. The basic method takes as input a large number of candidate features, and produces both a set of ideal features and a model incorporating those features. The method goes as follows:

For each candidate feature
- Compute a model (using the iterative scaling method) from the current set of optimal features, and the current feature.
- Compute the gain in log-likelihood.
After having done this for all the candidates, pick the best and add it to the set of optimal features
Continue to termination

The termination condition is not set in stone - the paper recommends using a withheld set and testing all of the optimal features, ignoring them if they don't improve the likelihood on the held out set.

This method is clearly not practical as you must compute a model for each potential feature, and the candidate set should be very exhaustive. Thus, the paper proposes approximating the gain for each feature to avoid complete re-computation of the model.

Experiments, Method, and Data

The case-study they introduce in the paper is one involving Machine Translation, translating French sentences to English. The goal of the paper is to use the described maximum entropy model to augment the basic translation model. The model introduces the concept of alignments, which yields both a sequence of words, as well as a mapping from the input sequence to the output sequence. They also include experiments for using maximum entropy models for two other tasks: segmentation, and reordering words.

Translation Model

Model

The model is designed to find:

${\bar {E}}={\underset {E}{\operatorname {argmax} }}\left[p(F|E)p(E)\right]$

Where ${\bar {E}}$ is the best English translation for the French sequence of words $F$ . $P(F|E)$ can be defined as a sum of the probabilities of all possible alignments $A$ of $E$ and $F$ .

This is defined as the translation model. Their initial model for computing the probability of an alignment $A$ of $E$ and $F$ is given as:

$P(F,A|E)=\prod _{i=1}^{|E|}{p(n(e_{i})|e_{i})}*\prod _{j=1}^{|F|}{p(y_{j}|e_{a_{j}})}*d(A|E,F)$

The first term is the product of the probabilities that a given English word $e_{i}$ produces $n$ French words. The second term is the product that the given English word produces the French word $y_{j}$ , and the final term is the probability of the ordering of the French words.

The drawback with the model is that it does not use any context in its usage. Their solution is to train a maximum entropy model for each English word $e$ such that it produces a French word $y$ based on some context $x$ . The new model is:

$P(F,A|E)=\prod _{i=1}^{|E|}{p(n(e_{i})|e_{i})}*\prod _{j=1}^{|F|}{p_{e_{a_{j}}}(y_{j}|x_{a_{j}})}*d(A|E,F)$

Results

The paper shows results from a very small sample: the translation for in, as well as for run. They don't describe the data run on, but they do explain the results. The model was trained using first the standard word identity features, and they list the additional features and their change in log-likelihood.

For in, the features added improve the log-likelihood from -2.9674 to -2.7792

For run, the features improve the log-likelihood from -4.8499 to -4.7146

Other Models

They describe using a feature-based model for segmenting phrases. This is intended to determine "appropriate" places in the text for running the translation model, by computing a probability of a rift at every given position in the sequence of words. They use various features, including part of speech, and train on data from an "expert" French segmenter. They use dynamic programming to produce the segments of data. They don't provide much empirical evidence, other than to show the log-likelihood of the training, and an example of segmented text.

The last model they describe is for re-ordering words. This is useful because often times, though the words are aligned properly, the English ordering is supposed to be different from the French ordering. They examine one specific case, NOUN de NOUN. They train the model on 10,000 different instances, and compare the maximum entropy model for swapping with one that does no swapping. They tested the results on 71,555 instances

Example (count)	Simple Model Accuracy	Maximum Entropy Model Accuracy
Non-interchanged (50,229)	100%	93.5%
Interchanged (21,326)	0%	49.2%
Total (71,555)	70.2%	80.4%

Related Work

As this was one of the earliest works in maximum entropy models as they're related to natural language processing, it is often used as background knowledge for other maximum entropy papers, including MEMMs. A few papers follow:

Berger et al 1996 a maximum entropy approach to natural language processing

Contents