Difference between revisions of "Gradient Boosted Decision Tree"

From Cohen Courses
Jump to navigationJump to search
 
(One intermediate revision by the same user not shown)
Line 3: Line 3:
 
<math>f_{t}(x)=T_{t}(x;\Theta)+\lambda\overset{T}{\underset{t=1}{\sum}}\beta_{t}T_{t}(x;\Theta_{t})</math>
 
<math>f_{t}(x)=T_{t}(x;\Theta)+\lambda\overset{T}{\underset{t=1}{\sum}}\beta_{t}T_{t}(x;\Theta_{t})</math>
  
such that a certain loss function <math>L(y_{i}, f_{T} (x+i))</math> is minimized, where <math>T_{t}(x;\Theta_{t})</math> is a tree at iteration <math>t</math>, weighted by parameter <math>\sigma_{t}</math>, with a finite number of parameters, <math>\Theta_{t}</math> and <math>\lambda</math> is the learning rate. At iteration <math>t</math>, tree <math>T_{t}(x;\sigma)</math> is induced to fit the negative gradient by least squares. That is
+
such that a certain loss function <math>L(y_{i}, f_{T} (x+i))</math> is minimized, where <math>T_{t}(x;\Theta_{t})</math> is a tree at iteration <math>t</math>, weighted by parameter <math>\beta_{t}</math>, with a finite number of parameters, <math>\Theta_{t}</math> and <math>\lambda</math> is the learning rate. At iteration <math>t</math>, tree <math>T_{t}(x;\sigma)</math> is induced to fit the negative gradient by least squares. That is
  
 
<math>\hat{\Theta}:=\operatorname{argmin}_{\beta}\overset{N}{\underset{i}{\sum}}(-G_{it}-\beta_{t}T_{t}(x_{i});\Theta)^{2}</math>
 
<math>\hat{\Theta}:=\operatorname{argmin}_{\beta}\overset{N}{\underset{i}{\sum}}(-G_{it}-\beta_{t}T_{t}(x_{i});\Theta)^{2}</math>
Line 15: Line 15:
 
<math>\beta_{t}=\operatorname{argmin}_{\beta}\overset{N}{\underset{i}{\sum}}L(y_{i},f_{t-1}(x_{i})+\beta T(x_{i},\theta))</math>
 
<math>\beta_{t}=\operatorname{argmin}_{\beta}\overset{N}{\underset{i}{\sum}}L(y_{i},f_{t-1}(x_{i})+\beta T(x_{i},\theta))</math>
  
(Source: [Dong et al WWW 2010])
+
''Source: [[Dong et al WWW 2010]]''

Latest revision as of 12:01, 29 March 2011

GBDT is an additive regression algorithm consisting of an ensemble of trees, fitted to current residuals, gradients of the loss function, in a forward step-wise manner. It iteratively fits an additive model as

such that a certain loss function is minimized, where is a tree at iteration , weighted by parameter , with a finite number of parameters, and is the learning rate. At iteration , tree is induced to fit the negative gradient by least squares. That is

where is the gradient over current prediction function

The optimal weights of trees are determined by

Source: Dong et al WWW 2010