Nschneid writeup of Klein 2002

From Cohen Courses
Jump to navigationJump to search

This is Nschneid's review of Klein_2002_conditional_structure_versus_conditional_estimation_in_nlp_models

I found this to be a very nice paper. The aim is to tease apart the issues of joint vs. conditional parameter estimation (as in generative HMM vs. linear-chain CRF) as opposed to joint vs. conditional model structure (as in an HMM vs. MEMM). §2 experimented with different objective functions for training a Naïve Bayes WSD model, finding that "optimizing for a given objective generally gave the best score for that objective for both the training set and the test set"—i.e. it makes sense to use a loss function reflecting your evaluation metric. Interestingly, this holds in their experiments even with a small amount of training data, contradicting a previous claim (Ng & Jordan 2002) that "the generative [joint model] should perform better in low-data situations."

In §3, joint vs. conditional model structures are compared for POS tagging. The difference here is found to be more substantial than with the training objective: MEMMs make damaging independence assumptions about the data, in particular assuming "that states are independent of future observations". This leads to biases in which labels have disproportionate influence over observations, or vice versa: label bias vs. observation bias.

Technical question: in the third paragraph of §3, the authors state: "If we maximize C[onditional]L[ikelihood], we get (possibly deficient) HMMs which are instances of [CRFs]." As I understand it, deficiency arises when the model assigns some probability mass to invalid or contradictory structures, or fails to assign probability mass to valid structures. How might maximizing CL make the model deficient?