Difference between revisions of "Practical very large CRFs"
Line 33: | Line 33: | ||
== Experiments == | == Experiments == | ||
− | The paper runs experiments on two domains: [[AddressesProblem::Part of Speech Tagging]] and [[AddressesProblem::Phonetization]] | + | The paper runs experiments on two domains: [[AddressesProblem::Part of Speech Tagging]] and [[AddressesProblem::Phonetization]]. Since runtime performance is also important, they run all of the processes on the same machine, and they are all using the same frameworks to allow for a fair competition. |
+ | |||
+ | === Phonetization === | ||
+ | |||
+ | They use [[UsesDataset::Nettalk]], which contains 20k English words, and their pronunciations and some prosodic information. |
Revision as of 02:02, 30 September 2011
Paper writeup in progress by Francis Keith
Contents
Citation
Lavergne, T., O. Cappé, and F. Yvon. Practical very large scale CRFs. ACL-2010.
Online Version
An online version of the paper is available here [1]
Summary
The goal of this paper is to show challenges and results when trying to use Conditional Random Fields with a very large number of features. They detail and compare the 3 methods for computing the penalty, as the sparsity introduced by the penalty makes the model significantly more compact.
The methods they compare are:
Implementation Issues with Large Scale CRFs
The paper goes into detail about a few issues with implementing CRFs on a large scale:
- As a generally accepted practice when using the Forward-Backward algorithm or any large probabilistic chain, people tend to work in the log-domain to guarantee avoidance of over/underflow. This causes a massive slowdown due to repeated calls to and . They work around this by normalizing and calls, and vectorizing calls.
- Computing the gradient often requires the most computation time. They resolve this by doing a few things:
- Using a sparse matrix M, the computation time becomes proportional to the average number of active features (as opposed to being proportional to all possible features)
- Parallelizing Forward-Backward
- In addition, Block Coordinate Descent also has the property of being able to truncate forward and backward probabilities after a certain point, which speeds up the running of this immensely.
- For large scale feature vectors (of size , which could be billions), memory begins to become an issue
- Block Coordinate Descent only requires a single vector of size
- Stochastic Gradient Descent requires 2 vectors of size
- Quasi Newton is implementation specific, but the authors' implementation required on the order of 12 vectors of size
Experiments
The paper runs experiments on two domains: Part of Speech Tagging and Phonetization. Since runtime performance is also important, they run all of the processes on the same machine, and they are all using the same frameworks to allow for a fair competition.
Phonetization
They use Nettalk, which contains 20k English words, and their pronunciations and some prosodic information.