Frietag 2000 Maximum Entropy Markov Models for Information Extraction and Segmentation

From Cohen Courses
Jump to navigationJump to search


McCallum, A. and Freitag, D. and Pereira, F. 2000. Maximum Entropy Markov Models for Information Extraction and Segmentation. In Proceedings of the Seventeenth International Conference on Machine Learning. 591--598

Online version

An online version of this paper is available [1].


This paper introduces Maximum Entropy Markov Models as sequential classification model that improves over Hidden Markov Models (HMMs).

Drawbacks with HMMs

The paper points out the following main drawbacks of HMMs

  • They don't allow the addition of arbitrary overlapping features of observation sequences
  • They are trained to maximize the joint likelihood of the state and observation sequences, while for many problems, the conditional likelihood of the states given the observations is more relevant

Mathematical Definition of MeMMs

In Maximum Entropy Markov models (MeMMs), the HMM transition and observation functions are replaced by a single function that provides the probability of the current state given the previous state and the current observation. In contrast to HMMs, in which the current observation only depends on the current state, the current observation in an MEMM may also depend on the previous state. In other words, the observations can be thought of as being associated with state transitions rather than with states. Moreover, for each state, such a transition function is given by a distinct Maxent model.

The paper illustrates a modification of the HMM Viterbi inference algorithm to perform inference for MeMMs. It also provides a Generalized Iterative Scaling (GIS) algorithm to train MeMMs, which is guaranteed to converge to a global maximum as long as there is at most one state per label.

For cases where training label sequences are unknown or ambiguous, the paper also gives a modification of the standard Baum-Welch training algorithm for HMMs.

Variations of MeMMs

The paper produces several variations of the basic MeMM architecture explained above:

  • Factored state representation

To deal with data sparseness problem in standard MeMMs (due to transition parameters), one can avoid having S different transition functions (one for each state), and just maintain one function, which uses information about the previous state as features. This reduces the expressive power of the model but allows sharing of information across states and alleviates sparseness problems.

  • Observations in states instead of transitions

Rather than combining transition and emission parameters into a single function, one could represent the transition probabilities as a standard multinomial, and P(S|O) using a Maxent model. This may also help with sparseness.

  • Environmental model for reinforcement learning

The transition function can also include an action, resulting in a model suitable for representing the environment of a reinforcement agent.


The authors trained 4 different types of models to classify lines from Usenet files into one of 4 categories: head, question, answer and tail. They used a set of 24 boolean features. The types of models they trained were: ME-Stateless (non-sequential Maxent), TokenHMM (a standard 4-state fully connected HMM), FeatureHMM (an HMM where the lines i.e. obsevations were replaced by their corresponding features), and the MeMM model described above. They found that the MeMM outperformed the other approaches.

Related papers

The original CRF (Lafferty 2001 Conditional Random Fields) papers, published just one year after this paper, introduces an even more expressive model, and CRFs are now considered state-of-the-art for sequential labeling tasks. For a background on exponential models and maximum entropy, the Berger et al CL 1996 paper that introduced the maximum entropy approach to NLP is suggested. The Ratnaparkhi, 1996 paper on Maxent POS tagging also introduces a closely related model.