Difference between revisions of "Practical very large CRFs"

From Cohen Courses
Jump to navigationJump to search
Line 20: Line 20:
 
== Implementation Issues with Large Scale CRFs ==
 
== Implementation Issues with Large Scale CRFs ==
  
The paper goes into detail about a few issues with implementing CRFs on a large scale:
+
The paper goes into detail about a few issues with implementing CRFs on a large scale, many of which are useful for implementation issues:
 
* As a generally accepted practice when using the [[UsesMethod::Forward-Backward|Forward-Backward algorithm]] or any large probabilistic chain, people tend to work in the log-domain to guarantee avoidance of over/underflow. This causes a massive slowdown due to repeated calls to <math>\operatorname{exp}</math> and <math>\operatorname{log}</math>. They work around this by normalizing <math>\alpha</math> and <math>\beta</math> calls, and vectorizing <math>\operatorname{exp}</math> calls.
 
* As a generally accepted practice when using the [[UsesMethod::Forward-Backward|Forward-Backward algorithm]] or any large probabilistic chain, people tend to work in the log-domain to guarantee avoidance of over/underflow. This causes a massive slowdown due to repeated calls to <math>\operatorname{exp}</math> and <math>\operatorname{log}</math>. They work around this by normalizing <math>\alpha</math> and <math>\beta</math> calls, and vectorizing <math>\operatorname{exp}</math> calls.
 
* Computing the gradient often requires the most computation time. They resolve this by doing a few things:
 
* Computing the gradient often requires the most computation time. They resolve this by doing a few things:
Line 37: Line 37:
 
=== Phonetization ===
 
=== Phonetization ===
  
They use [[UsesDataset::Nettalk]], which contains 20k English words, and their pronunciations and some prosodic information.
+
They use [[UsesDataset::Nettalk]], which contains ~20k English words, and their pronunciations and some prosodic information.
 +
 
 +
=== Part of Speech Tagging ===
 +
 
 +
They use the [[UsesDataset::Penn TreeBank]] for training, and [[UsesDataset::WSJ|Wall Street Journal]] for testing, using various ''n''-gram features, occasionally augmented with windows (show by the ''n''-gram+ and ''n''-gram++ cases).
 +
 
 +
==== Results ====
 +
Reg. is the regularization function used. Method is the method used (which ''n''-gram features). # Feat. is the number of features selected, not the total possible feature space.
 +
{| class="wikitable" border="1"
 +
|-
 +
! Reg.
 +
! Method
 +
! Iter.
 +
! # Feat.
 +
! Error
 +
! Time
 +
|-
 +
| [[UsesMethod::Quasi Newton|OWL-QN]]
 +
| 1-gram
 +
| 63.4
 +
| 4684
 +
| 17.79%
 +
| 11min
 +
|-
 +
| [[UsesMethod::Quasi Newton|OWL-QN]]
 +
| 7-gram
 +
| 140.2
 +
| 38214
 +
| 8.12%
 +
| 1h02min
 +
|-
 +
| [[UsesMethod::Quasi Newton|OWL-QN]]
 +
| 5-gram+
 +
| 141.0
 +
| 43429
 +
| 7.89%
 +
| 1h37min
 +
|-
 +
| [[UsesMethod::Stochastic Gradience Descent|SGD]]
 +
| 1-gram
 +
| 21.4
 +
| 3540
 +
| 18.21%
 +
| 9min
 +
|-
 +
| [[UsesMethod::Stochastic Gradience Descent|SGD]]
 +
| 5-gram+
 +
| 28.5
 +
| 34319
 +
| 8.01%
 +
| 45min
 +
|-
 +
| [[UsesMethod::Block Coordinate Descent|BCD]]
 +
| 1-gram
 +
| 28.2
 +
| 5017
 +
| 18.27%
 +
| 27min
 +
|-
 +
| [[UsesMethod::Block Coordinate Descent|BCD]]
 +
| 7-gram
 +
| 9.2
 +
| 3692
 +
| 8.21
 +
| 1h22min
 +
|-
 +
| [[UsesMethod::Block Coordinate Descent|BCD]]
 +
| 5-gram+
 +
| 8.7
 +
| 47675
 +
| 7.91%
 +
| 2h18min
 +
|}

Revision as of 10:09, 30 September 2011

Paper writeup in progress by Francis Keith

Citation

Lavergne, T., O. Cappé, and F. Yvon. Practical very large scale CRFs. ACL-2010.

Online Version

An online version of the paper is available here [1]

Summary

The goal of this paper is to show challenges and results when trying to use Conditional Random Fields with a very large number of features. They detail and compare the 3 methods for computing the penalty, as the sparsity introduced by the penalty makes the model significantly more compact.

The methods they compare are:

Implementation Issues with Large Scale CRFs

The paper goes into detail about a few issues with implementing CRFs on a large scale, many of which are useful for implementation issues:

  • As a generally accepted practice when using the Forward-Backward algorithm or any large probabilistic chain, people tend to work in the log-domain to guarantee avoidance of over/underflow. This causes a massive slowdown due to repeated calls to and . They work around this by normalizing and calls, and vectorizing calls.
  • Computing the gradient often requires the most computation time. They resolve this by doing a few things:
    • Using a sparse matrix M, the computation time becomes proportional to the average number of active features (as opposed to being proportional to all possible features)
    • Parallelizing Forward-Backward
    • In addition, Block Coordinate Descent also has the property of being able to truncate forward and backward probabilities after a certain point, which speeds up the running of this immensely.
  • For large scale feature vectors (of size , which could be billions), memory begins to become an issue

Experiments

The paper runs experiments on two domains: Part of Speech Tagging and Phonetization. Since runtime performance is also important, they run all of the processes on the same machine, and they are all using the same frameworks to allow for a fair competition.

Phonetization

They use Nettalk, which contains ~20k English words, and their pronunciations and some prosodic information.

Part of Speech Tagging

They use the Penn TreeBank for training, and Wall Street Journal for testing, using various n-gram features, occasionally augmented with windows (show by the n-gram+ and n-gram++ cases).

Results

Reg. is the regularization function used. Method is the method used (which n-gram features). # Feat. is the number of features selected, not the total possible feature space.

Reg. Method Iter. # Feat. Error Time
OWL-QN 1-gram 63.4 4684 17.79% 11min
OWL-QN 7-gram 140.2 38214 8.12% 1h02min
OWL-QN 5-gram+ 141.0 43429 7.89% 1h37min
SGD 1-gram 21.4 3540 18.21% 9min
SGD 5-gram+ 28.5 34319 8.01% 45min
BCD 1-gram 28.2 5017 18.27% 27min
BCD 7-gram 9.2 3692 8.21 1h22min
BCD 5-gram+ 8.7 47675 7.91% 2h18min