Practical very large CRFs
Contents
Citation
Lavergne, T., O. Cappé, and F. Yvon. Practical very large scale CRFs. ACL-2010.
Online Version
An online version of the paper is available here [1]
Summary
The goal of this paper is to show challenges and results when trying to use Conditional Random Fields with a very large number of features. They detail and compare the 3 methods for computing the penalty, as the sparsity introduced by the penalty makes the model significantly more compact.
The methods they compare are:
The conclusions they reach are:
- Quasi Newton is probably the best when dealing with small/moderate sized applications
- Block Coordinate Descent is efficient with very large feature sets, as long as there is a limited observation alphabet
- Stochastic Gradient Descent seems to be the best choice for large scale applications, as long as it is fine-tuned
- Large-scale sparse models can be efficiently trained, and perform better than small models.
Implementation Issues with Large Scale CRFs
The paper goes into detail about a few issues with implementing CRFs on a large scale, many of which are useful for implementation issues:
- As a generally accepted practice when using the Forward-Backward algorithm or any large probabilistic chain, people tend to work in the log-domain to guarantee avoidance of over/underflow. This causes a massive slowdown due to repeated calls to and . They work around this by normalizing and calls, and vectorizing calls.
- Computing the gradient often requires the most computation time. They resolve this by doing a few things:
- Using a sparse matrix M, the computation time becomes proportional to the average number of active features (as opposed to being proportional to all possible features)
- Parallelizing Forward-Backward
- In addition, Block Coordinate Descent also has the property of being able to truncate forward and backward probabilities after a certain point, which speeds up the running of this immensely.
- For large scale feature vectors (of size , which could be billions), memory begins to become an issue
- Block Coordinate Descent only requires a single vector of size
- Stochastic Gradient Descent requires 2 vectors of size
- Quasi Newton is implementation specific, but the authors' implementation required on the order of 12 vectors of size
Experiments
The paper runs experiments on two domains: Part of Speech Tagging and Phonetization. Since runtime performance is also important, they run all of the processes on the same machine, and they are all using the same frameworks to allow for a fair competition.
Phonetization
They use Nettalk, which contains ~20k English words, and their pronunciations and some prosodic information. They attempt to produce phonetizations for the test data.
Results
Reg. is the regularization function used. Method is the method used (which n-gram features). # Feat. is the number of features selected, not the total possible feature space.
This shows the accuracy without prosodic features.
Reg. | Method | Iter. | # Feat. | Error | Time |
---|---|---|---|---|---|
OWL-QN | 1-gram | 63.4 | 4684 | 17.79% | 11min |
OWL-QN | 7-gram | 140.2 | 38214 | 8.12% | 1h02min |
OWL-QN | 5-gram+ | 141.0 | 43429 | 7.89% | 1h37min |
SGD | 1-gram | 21.4 | 3540 | 18.21% | 9min |
SGD | 5-gram+ | 28.5 | 34319 | 8.01% | 45min |
BCD | 1-gram | 28.2 | 5017 | 18.27% | 27min |
BCD | 7-gram | 9.2 | 3692 | 8.21 | 1h22min |
BCD | 5-gram+ | 8.7 | 47675 | 7.91% | 2h18min |
Comparative results on Nettalk using CRFs are difficult to find, but the two examples they give (88.4% accuracy and 91.7% accuracy) seem to make the results pretty decent.
When running with prosodic features, they could not produce a model for Quasi Newton, and they don't really have anything to compare the results to.
Reg. | Method | Error | Time |
---|---|---|---|
SGD | 5-gram | 14.71%/8.11% | 55min |
SGD | 5-gram+ | 13.91%/7.52% | 2h45min |
BCD | 5-gram | 14.57%/8.06% | 2h46min |
BCD | 7-gram | 14.12%/7.86% | 3h02min |
BCD | 5-gram+ | 13.85%/7.47% | 7h14min |
BCD | 5-gram++ | 13.69%/7.36% | 16h03min |
Part of Speech Tagging
They use the Penn TreeBank for training, and Wall Street Journal for testing. They don't really give an in-depth analysis of the results, only saying that BCD is impractical to train, leading them to the conclusion that the method should only be used when the number of "blocks" is small. They also describe using an incremental training task, that uses begins only using features on the current word, then after a couple iterations adds in bigrams, then after a couple iterations add trigrams, etc. It produces an error rate of 2.63%/2.75% for POS Tagging, though they don't report the run speed.
Related Work
This paper is very important to the work being done on CRFs. They are defined in this paper:
Another "efficiency" paper, this time modeling the problem of semantic role labeling as a tree labeling problem to allow for an efficient algorithm:
Comments
This is a pretty handy paper when implementing a CRF.
Random note on the re-normalizing alternative to log-domain arithmetic. for forward-backwards, supposedly appears way back in the Rabiner 1989 tutorial on HMM's: http://www.cs.ubc.ca/~murphyk/Bayes/rabiner.pdf ... though I haven't confirmed this. Anyway, i used the re-normalizing trick once for implementing CRF forward-backwards, it works very nicely. So doubles underflow not until you go past 1e+-300 or so... so you underflow only if there's a factor of 1e300 change to a probability between t and (t+1)... very difficult to get that big a magnitude so quick. (Built up over a 20-long sequence, though, it's quite easy to underflow, only requires a 1e15 change each timestep.) --Brendan 21:03, 13 October 2011 (UTC)