Klein 2002 conditional structure versus conditional estimation in nlp models
Conditional structure versus conditional estimation in NLP models, by D. Klein, C. D Manning. In Proceedings of the ACL-02 conference on Empirical methods in natural language processing-Volume 10, 2002.
This paper uses the example problems of word-sense disambiguation and part-of-speech tagging to present a clear framework in which one can place various sequential and non-sequential mathematical models such as Naive Bayes, Logistic Regression, Hidden Markov Models, Maximum Entropy Markov Models and Conditional Random Fields. It presents a few key dimensions along which one can differentiate between these models:
- Whether the objective function being maximized during training is joint likelihood (JL) (e.g. Naive Bayes, HMMs), or conditional likelihood (CL) (e.g. logistic regression, CRFs). They also consider a variant of conditional likelihood called sum of conditional likelihoods (SCL), which is basically the sum of the probability masses assigned to the correct labels, as opposed to the product-computed probability of the correct label sequence.
- Whether the model has a joint structure (HMMs, CRFs) or a conditional structure (CMMs, MeMMs)
The authors conduct various experiments to determine the choices of the above dimensions that maximize performance for NLP applications. They find that:
- For word-sense disambiguation, NB models that maximize CL and SCL outperform NB models that maximize JL. However, this may not hold true in situations with very little training data. One should point out here that a Naive Bayes model that maximizes unconstrained CL is exactly the same as a logistic regression model.
- For part-of-speech tagging, there are two noteworthy results:
1) For a fixed structure, models that maximize CL (e.g. CRFs) usually perform better than models that maximize JL (e.g. HMMs).
2) For a fixed objective (JL or CL), models with joint structure outperform those with conditional structure.
The authors also find that observation bias was a more detrimental issue for the performance of their POS taggers than label bias.
The original MeMM (Frietag 2000 Maximum Entropy Markov Models for Information Extraction and Segmentation) and CRF (Lafferty 2001 Conditional Random Fields) papers would be good background reading in order to fully understand the concepts in this paper. The Berger et al CL 1996 paper that introduced the maximum entropy approach to NLP would also be beneficial to the reader.