Nlao writeup of Klein 2002
This is a review of Klein_2002_conditional_structure_versus_conditional_estimation_in_nlp_models by user:Nlao.
In this paper, NLP people want to compare related methods on the sequential tagging task: local/joint NB and MaxEnt on disambiguition task, HMM and MEMM on POS tagging task.
Generative models (e.g. HMM) are believed to be less accurate than discriminateve models, because not all the modeling power is focused on the tags. However, HMM is observed to outperform MEMM. People attribute this to the "label bias" problem, and the definition of which I found vague in the literature. My understanding is that there are two problems involved here.
The first is local normalization problem that when scores are normalized locally (e.g. MEMM) certain information is lost. For example, altough all possible values of S_t are very less likely conditioned on S_t-1=a, the model will not try to avoid S_t-1=a, because the scores on S_t are normalized before combined into the joint score. HMM should also suffer from this problem (although it is automatically normalized).
The second is bias problem. HMM structurally puts even weights between state-state correlation and state-observation correlation. In contrary, by combining these two aspect into one locel model, MEMM might choose to bias towards state-state or state-observation correlation thus leads to "label bias" or "observation bias" depending on the data. Actually we can help MEMM overcome this problem by let it train two local models, one only looks at states, one only looks at observations, and then brutally combine their scores.
As a conclusion, being discriminative and having good model structure are both important. The local normalization problem (instead of the bias problem as people usually claim)is solved by CRFs
[minor points]
- when a task has two popular solutions, it worth time to have deep comparison between them