Difference between revisions of "Practical very large CRFs"
From Cohen Courses
Jump to navigationJump to search (Created page with '[[Category::Paper]] writeup in progress by Francis Keith == Citation == Lavergne, T., O. Cappé, and F. Yvon. Practical very large scale CRFs. ACL-2010. == Onl…') |
|||
Line 8: | Line 8: | ||
An online version of the paper is available here [http://www.aclweb.org/anthology-new/P/P10/P10-1052.pdf] | An online version of the paper is available here [http://www.aclweb.org/anthology-new/P/P10/P10-1052.pdf] | ||
+ | |||
+ | == Summary == | ||
+ | |||
+ | The goal of this paper is to show challenges and results when trying to use [[UsesMethod::CRF|Conditional Random Fields]] with a very large number of features. They detail and compare the 3 methods for computing the <math>\ell^1</math> penalty, as the sparsity introduced by the <math>\ell^1</math> penalty makes the model significantly more compact. | ||
+ | |||
+ | The methods they compare are: | ||
+ | * [[UsesMethod::Quasi Newton]] | ||
+ | * [[UsesMethod::Stochastic Gradient Descent]] | ||
+ | * [[UsesMethod::Block Coordinate Descent]] | ||
+ | |||
+ | == Implementation Issues with Large Scale CRFs == | ||
+ | |||
+ | The paper goes into detail about a few issues with implementing CRFs on a large scale: | ||
+ | * As a generally accepted practice when using the [[UsesMethod::Forward-Backward|Forward-Backward algorithm]] or any large probabilistic chain, people tend to work in the log-domain to guarantee avoidance of over/underflow. This causes a massive slowdown due to repeated calls to <math>\operatorname{exp}</math> and <math>\operatorname{log}</math>. They work around this by normalizing <math>\alpha</math> and <math>\beta</math> calls, and vectorizing <math>\operatorname{exp}</math> calls. | ||
+ | * Computing the gradient often requires the most computation time. They resolve this by doing a few things: | ||
+ | ** Using a sparse matrix M, the computation time becomes proportional to the average number of active features (as opposed to being proportional to all possible features) | ||
+ | ** Parallelizing [[UsesMethod::Forward-Backward]] | ||
+ | ** In addition, [[UsesMethod::Block Coordinate Descent]] also has the property of being able to truncate forward and backward probabilities after a certain point, which speeds up the running of this immensely. | ||
+ | * For large scale feature vectors (of size <math>K</math>, which could be billions), memory begins to become an issue | ||
+ | ** [[UsesMethod::Block Coordinate Descent]] only requires a single vector of size <math>K</math> | ||
+ | ** [[UsesMethod::Stochastic Gradient Descent]] requires 2 vectors of size <math>K</math> | ||
+ | ** [[UsesMethod::Quasi Newton]] is implementation specific, but the authors' implementation required on the order of 12 vectors of size <math>K</math> |
Revision as of 01:57, 30 September 2011
Paper writeup in progress by Francis Keith
Citation
Lavergne, T., O. Cappé, and F. Yvon. Practical very large scale CRFs. ACL-2010.
Online Version
An online version of the paper is available here [1]
Summary
The goal of this paper is to show challenges and results when trying to use Conditional Random Fields with a very large number of features. They detail and compare the 3 methods for computing the penalty, as the sparsity introduced by the penalty makes the model significantly more compact.
The methods they compare are:
Implementation Issues with Large Scale CRFs
The paper goes into detail about a few issues with implementing CRFs on a large scale:
- As a generally accepted practice when using the Forward-Backward algorithm or any large probabilistic chain, people tend to work in the log-domain to guarantee avoidance of over/underflow. This causes a massive slowdown due to repeated calls to and . They work around this by normalizing and calls, and vectorizing calls.
- Computing the gradient often requires the most computation time. They resolve this by doing a few things:
- Using a sparse matrix M, the computation time becomes proportional to the average number of active features (as opposed to being proportional to all possible features)
- Parallelizing Forward-Backward
- In addition, Block Coordinate Descent also has the property of being able to truncate forward and backward probabilities after a certain point, which speeds up the running of this immensely.
- For large scale feature vectors (of size , which could be billions), memory begins to become an issue
- Block Coordinate Descent only requires a single vector of size
- Stochastic Gradient Descent requires 2 vectors of size
- Quasi Newton is implementation specific, but the authors' implementation required on the order of 12 vectors of size