Lau et al HLT 1993

Citation

Raymond Lau, Ronald Rosenfeld and Salim Roukos. Adaptive Language Modeling Using the Maximum Entropy Principle. In Proceedings of the ARPA Human Language Technology Workshop, published as Human Language Technology, pages 108–113. Morgan Kaufmann, March 1993.

Online version

ACL WEB

Summary

In this paper the authors focus on the development of Language Models using the Maximum Entropy Principle, in order to combine evidence from multiple sources (for example: trigrams and long distance triggers).

The state-of-the-art language model, by the writing time of the paper, was a "static" trigram model, where the probability of the next word is decided by looking merely at the two words that precedes it. By definition, this model, after trained, always derives the same word for the same sequence of two words, not being able to "adapt" to different texts and contexts. In the present paper the authors developed an Adaptive Model, i.e., a model that changes estimates as a result of "seeing" some of the text. This new interpretation allows one to process large heterogeneous data sources (different writing styles and/or topics) and does not require to be trained in the same domain that it will be used for.

The way this domain variation is implemented is the model is through the concept of "Trigger Pairs". If a word sequence $A$ is significantly correlated with another word sequence $B$ , one can say that $A\rightarrow B$ this is considered a trigger pair.

Given the document that was processed so far (h) and a word considered for the next position (w), there are many different estimates P(w|h), derived from the various triggers. How to combine them?

Brief Description of the Method

The Maximum Entropy Principle is a common technique used to combine several different knowledge sources in a combined estimate. This method reformulates the different estimates as constraints on the expectation of various functions to be satisfied by the combined estimante, ending up choosing, among the probability distributions that satisfy these constraints, the one with the highest entropy.

Given a general event space $\{x\}$ , to derive a combined probability function $P(x)$ , each constraint $i$ is associated with a constraint function $f_{i}(x)$ and a desired expectation $c_{i}$ . The constraint is then written as:

$E_{p}f_{i}{\overset {\underset {\mathrm {def} }{}}{=}}\sum _{X}P(x)f_{i}(x)=c_{i}$

Given consistent constraints, a unique ME solution is guaranteed to exist, and to be of the form:

$P(x)=\prod _{i}\mu _{i}^{f_{i}(x)}$

where the $\mu _{i}$ ’s are some unknown constants, to be found.

To search the exponential family defined by for the $\mu _{i}$ ’s that will make $P(x)$ satisfy all the constraints, the authors used an iterative algorithm, Generalized Iterative Scaling, which is guaranteed to converge to the solution.

Trigger pairs are formulated as the empirical expectation of a constraint function $f_{A\rightarrow B}$ as:

$f_{A\rightarrow B}(h,w)={\begin{cases}1,&{\mbox{if }}A\in h,w=B\\0,&{\mbox{otherwise}}\end{cases}}$

Its associated constrain comes directly from the previous equation

$\sum _{h}{\hat {P}}(h).\sum _{w}P(w|h).f_{i}(h,w)=c_{i}$

To incorporate the previous static model the authors formulated constraint functions to fit the ML/ME paradigm as bigrams:

$f_{w_{1},w_{2}}(h,w)={\begin{cases}1,&{\mbox{if }}h{\mbox{ends in }}w_{1}{\mbox{and }}w=w_{2}\\0,&{\mbox{otherwise}}\end{cases}}$

Similarly, the authors integrated the "bursty" nature of language, i.e., the fact that once an infrequent word occurs in a document, the probability of reoccurrence is significantly elevated. This was done as a trigger pair $A\rightarrow A$ .

Experimental Results

Trained on 5 million words of Wall Street Journal text, using a vocabulary comprised of the DARPA's 20k words. They tested the baseline trigram model, against their solutions. The formulations they came up to test were the ML/ME with the best 3 triggers for each word, and the ML/ME with the best 6 triggers for each word. For each of the latter methods they incorporated the static trigram model as a different experiment. The results can be seen in the following table:

ME is simple, intuitive and general. It can accomodate new factors, just reformulating them as constraints, and reutilize information from previous used models (as the static one).

The Generalized Iterative Scaling allows incremental adaptation, adding new constraints at any time, and guarantees to converge to a unique ME solution. Although is very expensive and requires that the constraints to be consistent (which may not apply if the constraints are derived from other data than the training, or are externally imposed).

The way the authors enconded self triggers do not take into account the number of times the word had previously occurred, which is significant for the problem. They leave as future work modeling the frequency of occurrences and distance of occurrence.

Lau et al HLT 1993

Contents

Citation

Online version

Summary

Brief Description of the Method

Experimental Results

Related Papers

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

Tools