Sgardine writeup of Klein and Manning 2002
This is a review of Klein_2002_conditional_structure_versus_conditional_estimation_in_nlp_models by user:sgardine.
This paper attempts to delineate the effects of conditional objective functions on the one hand and conditional structures on the other hand, specifically in the context of NLP tasks. The ML community has some theory and some empirical results in other domains, but this paper attempts to examine those results in the NLP domain.
For Word Sense Disambiguation, a Naive Bayes model was trained using a joint probability objective function and various conditional probability objective functions. The best accuracy (which, as noted in the paper is itself a conditional objective, albeit not directly optimizable) on the test set was achieved by maximizing a conditional objective function. This agrees with the views in the larger ML community, although the conditionally trained model outperformed the jointly trained model even with smaller training sets than expected. The authors speculated this was partially due to the use of smoothing (although they do not attempt to quantify this effect) and partially due to the prevalence of novel phenomena in the test corpus.
For the POS tagging task, the authors compared HMMs to (conditional) MEMMs. The best results were achieved by optimizing the parameters of an HMM with a conditional objective function. The authors conjecture that the MEMM fails at least in part because of observation bias, where a deterministic label causes previous states to be effectively ignored. They then show that a fix of unobserving unambiguous words improves the performance of the MEMM. I'd have to think about this more, but it seems like unambiguous words are just a special case of high-probability words (i.e. probability 1.0) and that the problem they describe persists (albeit less so) when the words have highly likely tags.