Difference between revisions of "Dietterich 2008 gradient tree boosting for training conditional random fields"

From Cohen Courses
Jump to navigationJump to search
 
(5 intermediate revisions by the same user not shown)
Line 1: Line 1:
 +
== Citation ==
 +
 
{{MyCitejournal | coauthors = G. Hao, A. Ashenfelter| date = 2008| first = T. G| journal = Journal of Machine Learning Research| last = Dietterich| pages = 2113-2139| title = Gradient Tree Boosting for Training Conditional Random Fields| url = http://jmlr.csail.mit.edu/papers/volume9/dietterich08a/dietterich08a.pdf| volume = 9 }}
 
{{MyCitejournal | coauthors = G. Hao, A. Ashenfelter| date = 2008| first = T. G| journal = Journal of Machine Learning Research| last = Dietterich| pages = 2113-2139| title = Gradient Tree Boosting for Training Conditional Random Fields| url = http://jmlr.csail.mit.edu/papers/volume9/dietterich08a/dietterich08a.pdf| volume = 9 }}
 +
 +
== Online Version ==
  
 
This [[Category::Paper]] is available online [http://jmlr.csail.mit.edu/papers/volume9/dietterich08a/dietterich08a.pdf].
 
This [[Category::Paper]] is available online [http://jmlr.csail.mit.edu/papers/volume9/dietterich08a/dietterich08a.pdf].
  
Under modification by [[User:dkulkarn]]
+
== Summary ==
 +
 
 +
The paper addresses the problem of combinatorial explosion of parameters of [[UsesMethod::Conditional Random Fields | CRFs]] when new features are introduced. It represents the potential functions as sums of regression trees. The authors claim that adding a regression tree is a big step in the feature space and hence it reduces the number of iterations. This leads to a significant performance improvement.
 +
 
 +
== Method ==
 +
In a traditional gradient descent method, a function <math>\Psi</math> is represented by linear function of parameters. Thus, the parameter values after <math> m </math> steps  would be
 +
 
 +
<math> \Theta_m = \Theta_0 + \delta_1 + \cdots + \delta_m </math>.
 +
 
 +
 
 +
Instead of parameterizing, functional gradient ascent assumes that <math>\Psi</math> is a weighted sum of functions;
 +
 
 +
<math>\Psi_m = \Psi_0 + \Delta_1 + \cdots + \Delta_m</math>, where
 +
 
 +
<math>\Delta_m = E_{\vec{x}, y} \left[ \nabla_\Psi \log P(y | \vec{x}; \Psi) \vert_{\Psi_{m-1}} \right] </math>
 +
 
 +
Since the joint distribution <math>\vec{x}, y</math> is not known, <math>\Delta_m</math> is empirically calculated from the training samples (which are assumed to be from the joint distribution. Now, a regression tree <math>h_m</math> is trained to minimize the squared error loss. Overfitting is prevented by using a shrinkage parameter instead of bounding the number of leaf nodes of the tree.
 +
 
 +
The authors define a "desirability" quotient for a particular label <math>y_t</math> as
 +
 
 +
<math>F^{y_t}(y_{t-1}, X) = \Psi(y_t, X) + \Psi(y_t, y_{t-1}, X)</math>
 +
 
 +
given <math>y_{t-1}</math> and <math>X</math>. The authors then calculate the functional gradient of <math>F^{y_t}(y_{t-1}, X)</math>.
 +
 
 +
== Evaluation ==
 +
 
 +
The authors evaluated the performance of their trained CRF's with Mallet.
 +
 
 +
=== Datasets ===
 +
 
 +
* [[UsesDataset::Protein Secondary Structure Benchmark (Qian and Sejnowski, 1988)]]
 +
 
 +
* [[UsesDataset::NETtalk | NETtalk Data Set]]
 +
 
 +
* [[UsesDataset::Hyphenation Data Set]]
 +
 
 +
* [[UsesDataset::Usenet FAQs Data Sets]].
 +
 
 +
=== Summarized Results ===
 +
 
 +
The prediction accuracy of the current method and Mallet is not statistically significant for all datasets other than FAQ ai-general data set (current method is more accurate than Mallet) and NETtalk data set (Mallet has better accuracy).
 +
 
 +
In terms of speed; the authors compared how CPU times scale over the number of features and sequence length. The method performs poorly for large sequences; equally good for small sequences and small features, and outperforms for large number of features.
 +
 
 +
 
 +
 
  
 
== Reviews of this paper ==
 
== Reviews of this paper ==
 
{{#ask: [[reviewed paper::dietterich_2008_gradient_tree_boosting_for_training_conditional_random_fields]] | ?reviewer}}
 
{{#ask: [[reviewed paper::dietterich_2008_gradient_tree_boosting_for_training_conditional_random_fields]] | ?reviewer}}

Latest revision as of 01:46, 3 November 2011

Citation

Gradient Tree Boosting for Training Conditional Random Fields. By T. G Dietterich, G. Hao, A. Ashenfelter. In Journal of Machine Learning Research, vol. 9 ({{{issue}}}), 2008.

Online Version

This Paper is available online [1].

Summary

The paper addresses the problem of combinatorial explosion of parameters of CRFs when new features are introduced. It represents the potential functions as sums of regression trees. The authors claim that adding a regression tree is a big step in the feature space and hence it reduces the number of iterations. This leads to a significant performance improvement.

Method

In a traditional gradient descent method, a function is represented by linear function of parameters. Thus, the parameter values after steps would be

.


Instead of parameterizing, functional gradient ascent assumes that is a weighted sum of functions;

, where

Since the joint distribution is not known, is empirically calculated from the training samples (which are assumed to be from the joint distribution. Now, a regression tree is trained to minimize the squared error loss. Overfitting is prevented by using a shrinkage parameter instead of bounding the number of leaf nodes of the tree.

The authors define a "desirability" quotient for a particular label as

given and . The authors then calculate the functional gradient of .

Evaluation

The authors evaluated the performance of their trained CRF's with Mallet.

Datasets

Summarized Results

The prediction accuracy of the current method and Mallet is not statistically significant for all datasets other than FAQ ai-general data set (current method is more accurate than Mallet) and NETtalk data set (Mallet has better accuracy).

In terms of speed; the authors compared how CPU times scale over the number of features and sequence length. The method performs poorly for large sequences; equally good for small sequences and small features, and outperforms for large number of features.



Reviews of this paper