Ratnaparkhi EMNLP 1996

From Cohen Courses
Revision as of 12:50, 27 October 2010 by Wcohen (talk | contribs) (→‎Summary)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigationJump to search

Citation

Ratnaparkhi, A. A maximum entropy model for part-of-speech tagging. In Proceedings of the Conference on Empirical Methods in Natural Language Processing EMNLP-96, (1996).

Online Version

http://acl.ldc.upenn.edu/W/W96/W96-0213.pdf

Summary

This paper introduced a method which applied maximum entropy models to the task of part of speech tagging. The maximum entropy idea is explored further in future papers using stricter Markov model frameworks, but here chiefly allowed for incorporation of very expressive features and flexible usage of context as input features in the model. Key ideas included

  • Maximum entropy avoids imposing distributional assumptions, other than constraining the feature expected values to match the sample average.
  • Rich, overlapping features (ie do not need to be independent). Features are combinations of the tags and the word history.
  • Employs context to improve POS tagging, including using previous tag(s) as input features.
  • Introduces beam search, rather than using typical HMM style dynamic programming.
  • Employs a tag dictionary to filter known incorrect tags for specific common words. This is more for speed than accuracy.
  • Specialized word features that target frequently mis-tagged words.
  • Unseen words during test time can be modeled similar to rare words during training.

The maxent POS tagger performed comparably to other state of the art taggers. However, compared to HMMs, it allowed for more diverse features. Compared to decision trees, it did not require word classes to avoid data fragmentation. Compared to rule based systems, it outputs probabilities that can be used by programs later in the NLP pipeline.

Related Papers

  • McCallum et al ICML 2000 apply maxent models similarly, but their Markov model allows for more flexible transition structures, helping to avoid data sparsity.
  • Brants ANLP 2000 argues that HMMs are superior to maxent models for POS tagging.