Difference between revisions of "Dietterich 2008 gradient tree boosting for training conditional random fields"

Revision as of 20:23, 30 September 2011

Citation

Gradient Tree Boosting for Training Conditional Random Fields. By T. G Dietterich, G. Hao, A. Ashenfelter. In Journal of Machine Learning Research, vol. 9 ({{{issue}}}), 2008.

Online Version

This Paper is available online [1].

Summary

The paper addresses the problem of combinatorial explosion of parameters of CRFs when new features are introduced. It represents the potential functions as sums of regression trees. The authors claim that adding a regression tree is a big step in the feature space and hence it reduces the number of iterations. This leads to a significant performance improvement.

Method

In a traditional gradient descent method, a function $\Psi$ is represented by linear function of parameters. Thus, the parameter values after $m$ steps would be $\Theta _{m}=\Theta _{0}+\delta _{1}+\cdots +\delta _{m}$ . Instead of parameterizing, functional gradient ascent assumes that $\Psi$ is a weighted sum of functions; $\Psi _{m}=\Psi _{0}+\Delta _{1}+\cdots +\Delta _{m}$ , where

$\Delta _{m}=E_{{\vec {x}},y}\left[\nabla _{\Psi }\log P(y|{\vec {x}};\Psi )\vert _{\Psi _{m-1}}\right]$

Since the joint distribution ${\vec {x}},y$ is not known, $\Delta _{m}$ is empirically calculated from the training samples (which are assumed to be from the joint distribution. Now, a regression tree $h_{m}$ is trained to minimize the squared error loss. Overfitting is prevented by using a shrinkage parameter instead of bounding the number of leaf nodes of the tree.

@@ Line 12: / Line 12: @@
 == Method ==
-=== Functional Gradient Descent ===
+In a traditional gradient descent method, a function <math>\Psi</math> is represented by linear function of parameters. Thus, the parameter values after <math> m </math> steps  would be <math> \Theta_m = \Theta_0 + \delta_1 + \cdots + \delta_m </math>. Instead of parameterizing, functional gradient ascent assumes that <math>\Psi</math> is a weighted sum of functions; <math>\Psi_m = \Psi_0 + \Delta_1 + \cdots + \Delta_m</math>, where
-In a traditional gradient descent method, a function <math>\Psi</math> is represented by linear function of parameters. Thus, the parameter values after <math> m </math> steps  would be <math> \Theta_m = \Theta_0 + \delta_1 + \cdots + \delta_m </math>. Instead of parameterizing, functional gradient ascent assumes that <math>\Psi</math> is a weighted sum of functions; <math>\Psi_m = \Psi_0 + \Delta_1 + \cdots + \Delta_m</math>.
+<math>\Delta_m = E_{\vec{x}, y} \left[ \nabla_\Psi \log P(y | \vec{x}; \Psi) \vert_{\Psi_{m-1}} \right] </math>
+Since the joint distribution <math>\vec{x}, y</math> is not known, <math>\Delta_m</math> is empirically calculated from the training samples (which are assumed to be from the joint distribution. Now, a regression tree <math>h_m</math> is trained to minimize the squared error loss. Overfitting is prevented by using a shrinkage parameter instead of bounding the number of leaf nodes of the tree.
 == Reviews of this paper ==
 {{#ask: [[reviewed paper::dietterich_2008_gradient_tree_boosting_for_training_conditional_random_fields]] | ?reviewer}}

Difference between revisions of "Dietterich 2008 gradient tree boosting for training conditional random fields"

Revision as of 20:23, 30 September 2011

Contents

Citation

Online Version

Summary

Method

Reviews of this paper

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

Tools