Difference between revisions of "Practical very large CRFs"

Revision as of 02:59, 30 September 2011

Paper writeup in progress by Francis Keith

Citation

Lavergne, T., O. Cappé, and F. Yvon. Practical very large scale CRFs. ACL-2010.

Online Version

An online version of the paper is available here [1]

Summary

The goal of this paper is to show challenges and results when trying to use Conditional Random Fields with a very large number of features. They detail and compare the 3 methods for computing the $\ell ^{1}$ penalty, as the sparsity introduced by the $\ell ^{1}$ penalty makes the model significantly more compact.

The methods they compare are:

Implementation Issues with Large Scale CRFs

The paper goes into detail about a few issues with implementing CRFs on a large scale:

As a generally accepted practice when using the Forward-Backward algorithm or any large probabilistic chain, people tend to work in the log-domain to guarantee avoidance of over/underflow. This causes a massive slowdown due to repeated calls to $\operatorname {exp}$ and $\operatorname {log}$ . They work around this by normalizing $\alpha$ and $\beta$ calls, and vectorizing $\operatorname {exp}$ calls.
Computing the gradient often requires the most computation time. They resolve this by doing a few things:
- Using a sparse matrix M, the computation time becomes proportional to the average number of active features (as opposed to being proportional to all possible features)
- Parallelizing Forward-Backward
- In addition, Block Coordinate Descent also has the property of being able to truncate forward and backward probabilities after a certain point, which speeds up the running of this immensely.
For large scale feature vectors (of size $K$ $K$ , which could be billions), memory begins to become an issue
- Block Coordinate Descent only requires a single vector of size $K$
- Stochastic Gradient Descent requires 2 vectors of size $K$
- Quasi Newton is implementation specific, but the authors' implementation required on the order of 12 vectors of size $K$

Experiments

The paper runs experiments on two domains: Part of Speech Tagging and Phonetization

@@ Line 30: / Line 30: @@
 ** [[UsesMethod::Stochastic Gradient Descent]] requires 2 vectors of size <math>K</math>
 ** [[UsesMethod::Quasi Newton]] is implementation specific, but the authors' implementation required on the order of 12 vectors of size <math>K</math>
+== Experiments ==
+The paper runs experiments on two domains: [[AddressesProblem::Part of Speech Tagging]] and [[AddressesProblem::Phonetization]]

Difference between revisions of "Practical very large CRFs"

Revision as of 02:59, 30 September 2011

Contents

Citation

Online Version

Summary

Implementation Issues with Large Scale CRFs

Experiments

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

Tools