Berger et al 1996 a maximum entropy approach to natural language processing

Citation

Adam Berger, Stephen Della Pietra, and Vincent Della Pietra; Computational Linguistics, (22-1), March 1996;

Online Version

An online version is located at [1]

Summary

This oft-cited paper explains the concept of Maximum Entropy Models and relates them to natural language processing, specifically as they can be applied to Machine Translation

Explanation and Discussion

Maximum Entropy

The paper goes into a fairly detailed explanation of the motivation behind Maximum Entropy Models. They divide it into 2 sub-problems: Finding facts about the data, and incorporating the facts into the model. These facts are the "features" of the data.

Experiments, Method, and Data

The case-study they introduce in the paper is one involving Machine Translation, translating French sentences to English. The goal of the paper is to use the described maximum entropy model to augment the basic translation model. The model introduces the concept of alignments, which yields both a sequence of words, as well as a mapping from the input sequence to the output sequence.

Translation Model

Model

The model is designed to find:

${\bar {E}}={\underset {E}{\operatorname {argmax} }}\left[p(F|E)p(E)\right]$

Where ${\bar {E}}$ is the best English translation for the French sequence of words $F$ . $P(F|E)$ can be defined as a sum of the probabilities of all possible alignments $A$ of $E$ and $F$ .

This is defined as the translation model. Their initial model for computing the probability of an alignment $A$ of $E$ and $F$ is given as:

$P(F,A|E)=\prod _{i=1}^{|E|}{p(n(e_{i})|e_{i})}*\prod _{j=1}^{|F|}{p(y_{j}|e_{a_{j}})}*d(A|E,F)$

The first term is the product of the probabilities that a given English word $e_{i}$ produces $n$ French words. The second term is the product that the given English word produces the French word $y_{j}$ , and the final term is the probability of the ordering of the French words.

The drawback with the model is that it does not use any context in its usage. Their solution is to train a maximum entropy model for each English word $e$ such that it produces a French word $y$ based on some context $x$ . The new model is:

$P(F,A|E)=\prod _{i=1}^{|E|}{p(n(e_{i})|e_{i})}*\prod _{j=1}^{|F|}{p_{e_{a_{j}}}(y_{j}|x_{a_{j}})}*d(A|E,F)$

Berger et al 1996 a maximum entropy approach to natural language processing

Contents

Citation

Online Version

Summary

Explanation and Discussion

Maximum Entropy

Experiments, Method, and Data

Translation Model

Model

Results

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

Tools