Yandongl writeup of Sha 2003
In this paper authors studied fast training method of CRF, and applied it to shallow parsing tasks (in this paper, NP chunking to be specific). Papers started with comparison between different machine learning approaches to sequence labeling such as k-order generative probabilistic models such as HMM, as well as sequence tagging approach. As authors pointed out, generative models heavily depend on the (conditional) independence between variables. Oftentimes, violation of this independence makes generative models difficult to train. In contrast, sequential tagging approach such as maximum-entropy are trained directly to improve the accuracy metrics, and hence works better. However, one problem of it if they often reach local mamximum.
CRF has the benefits of both generative and classification models. Lafferty showed CRF beat other models for POS task using an iterative scaling algorithm for training, which is shown to be slow to converge. In order to solve this, many other approaches have been proposed such as conjugate-gradient (CG), L-BFGS, Voted Perceptron, etc.
Authors carried out a series of experiments comparing those training method. The datasets are RM data set of Ramshaw and Marcus, and modified CoNLL-2000 version of Tjong Kim Sand and Buchholz. Chunking CRFs have a second-order Markov dependency between chunk tags. Parameter tuning is introduced because F score reaches maximum while log-likelihood keeps growing.
Results showed that first, CRF beats other models. Second effective training can be obtained via CG or L-BFGS, both of which are much faster than GIS.