Rosenfield and Huang, 1992

From Cohen Courses
Jump to navigationJump to search

Citation

Rosenfield, R., and Huang, X. D., "Improvements in Stochastic Language Modeling". Proceeding of the Speech and Natural Language DARPA Workshop, Feb. 1992.

Online Version

[1]

Abstract

We describe two attempts to improve our stochastic language models. In the first, we identify a systematic overestimation in the traditional backoff model, and use statistical reasoning to correct it. Our modification results in up to 6% reduction in the perplexity of various tasks. Although the improvement is modest, it is achieved with hardly any increase in the complexity of the model. Both analysis and empirical data suggest that the modification is most suitable when training data is sparse.

In the second attempt, we propose a new type of adaptive language model. Existing adaptive models use a dynamic cache, based on the history of the document seen up to that point. But another source of information in the history, within-document word sequence correlations, has not yet been tapped. We describe a model that attempts to capture this information, using a framework where one word sequence triggers another, causing its estimated probability to be raised. We discuss various issues in the design of such a model, and describe our first attempt at building one. Our preliminary results include a perplexity reduction of between 10% and 32%, depending on the test set.