Saul and Pereira, EMNLP 1997

From Cohen Courses
Jump to navigationJump to search

Aggregate and mixed-order Markov modles for statistical language processing

This paper can be found at [1]

Citation

Lawrence Saul and Fernando Pereira. Aggregate and mixed-order Markov models for statistical language processing. In Proceedings of the Second Conference on Empirical Methods in Natural Language Processing, pages 81–89, Providence, Rhode Island, USA, August 1997.

Summary

The authors wanted to train a language model that used fewer parameters than an n-gram model, but still performed well. They developed aggregate and mixed-order markov models. While similar in some ways (mainly their Markovian properties), the new methods give fewer words 0 probability. It did perform worse than the n-gram models, but it used fewer parameters.

Aggregate Markov Models

They set for some number of classes . While this has the Markovian property, it is not a hidden markov model; it doesn't conform to some of the restrictions of an HMM. The idea is that classes will be developed from data that prove useful to calculate probabilities. They train using the EM algorithm: the E-step finds the probability of being in category given words : . The M-step calculates the model's parameters by counting. In general, the more classes, the lower the perplexity on test data. This makes sense. If there are classes (for the number of items in the vocabulary), then it would be a true bigram model. If there were one class, it would be a regular unigram model.

Mixed-order Markov Models

The size of n-grams increase with the square of . As a replacement, the authors devised a "skip-k" transition matrix that predicts the current word based on the word k-words in the past. A mixed-order markov model uses many different skip-k transition probabilities. If we used of the special bigram models, the probability from this model would be: . The weights control how much weight to apportion to each skip-k probability. This allows us to calculate that for some words it will be more useful to see the word right before us, while for others, it'll be more instructive to see the word three words in the past. Again, this is trained using EM. The E-step is to see the probability of each word assuming it came from . The M-step is to update the model by counting.

Experiments

They wanted to see the perplexity of the models generated. Unfortunately, the models still gave zero probabilities to some words so they had to be smoothed or interpolated. Smoothing mixed-order models did better the higher the , which is intuitive. The most interesting experiment was smoothing a regular trigram model using the model. They were able to greatly reduce the number of unseen words there were in the test set. The most interesting is that excluding trigrams that only occured once in the training corpus helped the perplexity in the interpolated case. In normal backoff, not including the seen-once trigrams increased the perplexity from 95.2 to 98.6.

Future Work

Aggregate Markov models can kinda be viewed as approximations of a full bigram model. In this way, the method is similar to SVD decomposition, which might be related. The classic question of generative vs discriminative models comes up. The authors argue that their generative model is fine because they train the mixture of generative models such that they model reality. Theirs models also train in a fraction of the time spent training Rosenfeld's models.

Related Work