Difference between revisions of "Gradient Boosted Decision Tree"
From Cohen Courses
Jump to navigationJump to search(One intermediate revision by the same user not shown) | |||
Line 3: | Line 3: | ||
<math>f_{t}(x)=T_{t}(x;\Theta)+\lambda\overset{T}{\underset{t=1}{\sum}}\beta_{t}T_{t}(x;\Theta_{t})</math> | <math>f_{t}(x)=T_{t}(x;\Theta)+\lambda\overset{T}{\underset{t=1}{\sum}}\beta_{t}T_{t}(x;\Theta_{t})</math> | ||
− | such that a certain loss function <math>L(y_{i}, f_{T} (x+i))</math> is minimized, where <math>T_{t}(x;\Theta_{t})</math> is a tree at iteration <math>t</math>, weighted by parameter <math>\ | + | such that a certain loss function <math>L(y_{i}, f_{T} (x+i))</math> is minimized, where <math>T_{t}(x;\Theta_{t})</math> is a tree at iteration <math>t</math>, weighted by parameter <math>\beta_{t}</math>, with a finite number of parameters, <math>\Theta_{t}</math> and <math>\lambda</math> is the learning rate. At iteration <math>t</math>, tree <math>T_{t}(x;\sigma)</math> is induced to fit the negative gradient by least squares. That is |
<math>\hat{\Theta}:=\operatorname{argmin}_{\beta}\overset{N}{\underset{i}{\sum}}(-G_{it}-\beta_{t}T_{t}(x_{i});\Theta)^{2}</math> | <math>\hat{\Theta}:=\operatorname{argmin}_{\beta}\overset{N}{\underset{i}{\sum}}(-G_{it}-\beta_{t}T_{t}(x_{i});\Theta)^{2}</math> | ||
Line 15: | Line 15: | ||
<math>\beta_{t}=\operatorname{argmin}_{\beta}\overset{N}{\underset{i}{\sum}}L(y_{i},f_{t-1}(x_{i})+\beta T(x_{i},\theta))</math> | <math>\beta_{t}=\operatorname{argmin}_{\beta}\overset{N}{\underset{i}{\sum}}L(y_{i},f_{t-1}(x_{i})+\beta T(x_{i},\theta))</math> | ||
− | + | ''Source: [[Dong et al WWW 2010]]'' |
Latest revision as of 12:01, 29 March 2011
GBDT is an additive regression algorithm consisting of an ensemble of trees, fitted to current residuals, gradients of the loss function, in a forward step-wise manner. It iteratively fits an additive model as
such that a certain loss function is minimized, where is a tree at iteration , weighted by parameter , with a finite number of parameters, and is the learning rate. At iteration , tree is induced to fit the negative gradient by least squares. That is
where is the gradient over current prediction function
The optimal weights of trees are determined by
Source: Dong et al WWW 2010