Difference between revisions of "Saul and Pereira, EMNLP 1997"
Line 10: | Line 10: | ||
==Aggregate Markov Models== | ==Aggregate Markov Models== | ||
− | They set <math>P(w_2|w_1) = \sum_{i=1}^C P(w_2|c)P(c|w_1)</math> for some number of classes <math>C</math>. | + | They set <math>P(w_2|w_1) = \sum_{i=1}^C P(w_2|c)P(c|w_1)</math> for some number of classes <math>C</math>. While this has the Markovian property, it is not a [[HMM|hidden markov model]]; it doesn't conform to some of the restrictions of an HMM. The idea is that classes will be developed from data that prove useful to calculate probabilities. They train using the [[UsesMethod::EM]] algorithm: the E-step finds the probability of <math>w_1</math> being in category <math>c</math> given words <math>w_1, w_2</math>: <math>P(c|w_1 w_2) = \dfrac{P(w_2|c)P(c|w_1)}{\sum_{c'}P(w_2|c')P(c'|w1)}</math>. The M-step calculates the model's parameters by counting. In general, the more classes, the lower the perplexity on test data. This makes sense. If there are <math>V</math> classes (for the number of items in the vocabulary), then it would be a true bigram model. If there were one class, it would be a regular unigram model. |
+ | |||
+ | ==Mixed-order Markov Models== | ||
+ | |||
+ | |||
==Future Work== | ==Future Work== |
Revision as of 03:23, 1 December 2011
Aggregate and mixed-order Markov modles for statistical language processing
This paper can be found at [1]
Contents
Citation
Lawrence Saul and Fernando Pereira. Aggregate and mixed-order Markov models for statistical language processing. In Proceedings of the Second Conference on Empirical Methods in Natural Language Processing, pages 81–89, Providence, Rhode Island, USA, August 1997.
Summary
The authors wanted to train a language model that used fewer parameters than an n-gram model, but still performed well. They developed aggregate and mixed-order markov models. While similar in some ways (mainly their Markovian properties), the new methods give fewer words 0 probability. It did perform worse than the n-gram models, but it used fewer parameters.
Aggregate Markov Models
They set for some number of classes . While this has the Markovian property, it is not a hidden markov model; it doesn't conform to some of the restrictions of an HMM. The idea is that classes will be developed from data that prove useful to calculate probabilities. They train using the EM algorithm: the E-step finds the probability of being in category given words : . The M-step calculates the model's parameters by counting. In general, the more classes, the lower the perplexity on test data. This makes sense. If there are classes (for the number of items in the vocabulary), then it would be a true bigram model. If there were one class, it would be a regular unigram model.
Mixed-order Markov Models
Future Work
Aggregate Markov models can kinda be viewed as approximations of a full bigram model. In this way, the method is similar to SVD decomposition, which might be related. The classic question of generative vs discriminative models comes up. The authors argue that their generative model is fine because they train the mixture of generative models such that they model reality. Theirs models also train in a fraction of the time spent training Rosenfeld's models.
Related Work
- Rosenfeld, Computer Speech and Language 1996 worked on language models, but from a discriminative stance.
- Other combinations of language models include work done by Huang et al, Computer Speech and Language 1993 and Ney, et al, Computer Speech and Language 1994
- Jelinek et al, Advances in Speech Signal Processing 1992