Difference between revisions of "Lafferty 2001 Conditional Random Fields"

From Cohen Courses
Jump to navigationJump to search
 
(5 intermediate revisions by the same user not shown)
Line 2: Line 2:
  
 
John Lafferty, Fernando Pereira, and Andrew McCallum. 2001. Conditional random fields: Probabilistic models for segmenting
 
John Lafferty, Fernando Pereira, and Andrew McCallum. 2001. Conditional random fields: Probabilistic models for segmenting
and labeling sequence data. In ICML.
+
and labeling sequence data. In Proceedings of ICML.
  
 
== Online version ==
 
== Online version ==
Line 10: Line 10:
 
== Summary ==
 
== Summary ==
  
This [[Category::paper]] introduces [[UsesMethod::Conditional Random Fields]] as sequential classification model. They improve over HMMs and MeMMs in the follwing ways:
+
This [[Category::paper]] introduces [[UsesMethod::Conditional Random Fields]] as sequential classification model.  
  
* Unlike HMMS, they allow the addition of arbitrary features of the observations. These features can be at any level of the observations, and can also be overlapping.
+
==Drawbacks with HMMs and MeMMs ==
  
* Unlike HMMs, they are optimized according to the conditional likelihood of the state sequence given the observation sequence, instead of according to their joint likelihood, as HMMs are. Consequently, they are able to achieve greater accuracy for many NLP sequential labeling tasks.
+
The paper points out the some of main drawbacks in generative models such as HMMs and in discriminative ones such as Maximum Entropy Markov Models:
 +
* HMMs don't allow the addition of overlapping arbitrary features of the observations.  
  
*Unlike MeMMs, they are normalized globally instead of locally over the input observation sequence. This helps them avoid the label bias problem in MeMMs, where the transitions going out from a state compete only against each other, as opposed to all the other transitions in the model.
+
* MeMMs are normalized locally over each observation, and hence suffer from the [[AddressesProblem::Label Bias problem]], where the transitions going out from a state compete only against each other, as opposed to all the other transitions in the model.
  
The key points of Conditional Random Fields, as introduced by the paper are:
+
==Mathematical Definition of CRFs==
  
* There are no directed arcs from observations to states or vice versa, rather the model is a type of undirected Markov Random Field.
+
The paper then introduces Conditional Random Fields as a model that addresses both of these problems, and defines them formally:
  
* Flow going out of the states need not be normalized and sum to 1.
+
[[File:Crf2.png]]
  
* The structure of the model is generative, as HMMs, but the optimization criterion is conditional likelihood, as in MeMMs.
+
The probability P(Y/X) of a state sequence Y given an observation sequence X is:
  
* Unlike MeMMs, inference is required during training. This makes training substantially slower than for MeMMs.
+
[[File:Crf3.png]]
  
* For testing however, inference can be efficently done by using a modified Viterbi algorithm, just as in HMMs and MeMMs.
+
where y|S is the set of components of y associated with the vertices in subgraph S, and features <math>f_k</math> and <math>g_k</math> are given and fixed
 +
CRFs can be conceptualized as:
  
* Training is performed using Improved Iterative Scaling.
+
[[File:Crf1.png]]
  
CRFs are considered state of the art for sequential labeling tasks in NLP. After this paper, subsequent research has introduced faster ways of training CRFs such as L-BFGS, Stochastic Gradient Descent, etc.
+
==Parameter Estimation for CRFs==
 +
The paper provides two methods to perform parameter estimation (training) for CRFs, both based on improved iterative scaling. One iteration of either of these methods has roughly the same time and space complexity as the Baum-Welch training algorithm for HMMs. Both are guaranteed to converge.
 +
 
 +
==Experiments==
 +
 
 +
The authors performs three different types of experiments:
 +
 
 +
* They generate synthetic data from an HMM, and use this data to train a CRF and an MeMM with the same structures. They find that the CRF error is 4.6% and the MeMM error is 42%, thus proving that the CRF solves the label bias problem encountered by the MeMM.
 +
 
 +
* They generate synthetic data using a mix of first-order and second-order HMMs, and train first-order HMM, MeMM and CRF models on this data, without using any overlapping features. They find that CRFs perform the best, thus showing their robustness to incorrect modeling assumptions. The addition of overlapping features substantially improves the performance of CRFs and MeMMs
 +
 
 +
* Finally, they perform POS tagging on a subset of the Penn Treebank, using an HMM, MeMM and a CRF. They repeat this both without and with orthographic features. Without orthographic features, the HMM outperforms the MeMM and the CRF outperforms the HMM, while with them, the MeMM and CRF both significantly outperform HMMs, and the CRF still remains the best.
 +
 
 +
==Conclusion==
 +
The authors conclude that CRFs offer the following significant advantages: discriminative training, combination of arbitrary, overlapping  features from both the past and future; efficient training and decoding based on dynamic programming; and parameter estimation guaranteed to find the global optimum. Their main disadvantage is the slow convergence of the training algorithm compared to MeMMs and HMMs.
  
 
== Related papers ==
 
== Related papers ==
  
The [[RelatedPaper::Sha 2003 shallow parsing with conditional random fields]] uses a simpler version of CRFs called linear-chain CRFs ,that model the states as being a chain, to perform NP (Noun Phrase) chunking.
+
The [[RelatedPaper::Sha 2003 shallow parsing with conditional random fields]] paper uses a simpler version of CRFs called linear-chain CRFs ,that model the states as being a chain, to perform NP (Noun Phrase) chunking. It also compares different methods of training CRFs, such as CG, L-BFGS, GIS etc.

Latest revision as of 11:57, 6 October 2010

Citation

John Lafferty, Fernando Pereira, and Andrew McCallum. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of ICML.

Online version

An online version of this paper is available [1].

Summary

This paper introduces Conditional Random Fields as sequential classification model.

Drawbacks with HMMs and MeMMs

The paper points out the some of main drawbacks in generative models such as HMMs and in discriminative ones such as Maximum Entropy Markov Models:

  • HMMs don't allow the addition of overlapping arbitrary features of the observations.
  • MeMMs are normalized locally over each observation, and hence suffer from the Label Bias problem, where the transitions going out from a state compete only against each other, as opposed to all the other transitions in the model.

Mathematical Definition of CRFs

The paper then introduces Conditional Random Fields as a model that addresses both of these problems, and defines them formally:

Crf2.png

The probability P(Y/X) of a state sequence Y given an observation sequence X is:

Crf3.png

where y|S is the set of components of y associated with the vertices in subgraph S, and features and are given and fixed CRFs can be conceptualized as:

Crf1.png

Parameter Estimation for CRFs

The paper provides two methods to perform parameter estimation (training) for CRFs, both based on improved iterative scaling. One iteration of either of these methods has roughly the same time and space complexity as the Baum-Welch training algorithm for HMMs. Both are guaranteed to converge.

Experiments

The authors performs three different types of experiments:

  • They generate synthetic data from an HMM, and use this data to train a CRF and an MeMM with the same structures. They find that the CRF error is 4.6% and the MeMM error is 42%, thus proving that the CRF solves the label bias problem encountered by the MeMM.
  • They generate synthetic data using a mix of first-order and second-order HMMs, and train first-order HMM, MeMM and CRF models on this data, without using any overlapping features. They find that CRFs perform the best, thus showing their robustness to incorrect modeling assumptions. The addition of overlapping features substantially improves the performance of CRFs and MeMMs
  • Finally, they perform POS tagging on a subset of the Penn Treebank, using an HMM, MeMM and a CRF. They repeat this both without and with orthographic features. Without orthographic features, the HMM outperforms the MeMM and the CRF outperforms the HMM, while with them, the MeMM and CRF both significantly outperform HMMs, and the CRF still remains the best.

Conclusion

The authors conclude that CRFs offer the following significant advantages: discriminative training, combination of arbitrary, overlapping features from both the past and future; efficient training and decoding based on dynamic programming; and parameter estimation guaranteed to find the global optimum. Their main disadvantage is the slow convergence of the training algorithm compared to MeMMs and HMMs.

Related papers

The Sha 2003 shallow parsing with conditional random fields paper uses a simpler version of CRFs called linear-chain CRFs ,that model the states as being a chain, to perform NP (Noun Phrase) chunking. It also compares different methods of training CRFs, such as CG, L-BFGS, GIS etc.