Difference between revisions of "L-BFGS"
(Created page with 'This [[Category::Method | method]] page will be completed shortly by Avneesh.') |
|||
(7 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
− | + | === Citation === | |
+ | {{MyCiteconference | booktitle = Journal of Optimization Theory and Applications | coauthors = | date = 1985| first = D.F.| last = Shanno | title = On the Broyden-Fletcher-Goldfarb-Shanno Method}} | ||
+ | === Summary === | ||
+ | |||
+ | The L-BFGS algorithm is an optimization [[Category::Method | method]] that falls under the group of techniques known as "quasi-Newton" optimization methods. The "L" stands for "Limited" (in the limited memory sense, not that the method is necessarily limited), and BFGS are the individuals who came up with the original (non-limited memory variant) algorithm: Broyden, Fletcher, Goldfarb, and Shanno. Here is a picture of all of them: | ||
+ | |||
+ | [[File:bfgscharacters.png]] | ||
+ | |||
+ | In general, whereas optimization methods like the cutting planes algorithm, which are in the category of "bundle methods", essentially lower bound the objective function using linear gradients, quasi-Newton algorithms use the gradients to estimate the Hessian (the square matrix of the second-order partial derivatives, think of it as the vector generalization of a second derivative). With the gradient and an approximated Hessian, we can build a quadratic approximation of the objective function. | ||
+ | |||
+ | === Details === | ||
+ | |||
+ | The standard BFGS approximates the objective function with a locally quadratic optimization: | ||
+ | <math> m_t (w) J(w_t) + \langle \nabla J(w_t), w-w_t \rangle + \frac{1}{2} (w-w_t)^T H_t(w-w_t) </math> | ||
+ | where <math>J(w_t)</math> is the objective function, <math>\nabla J(w_t)</math> is the gradient, and <math>H_t</math> is the Hessian. | ||
+ | |||
+ | We update parameters with the weight that minimizes the quadratic approximation: | ||
+ | <math>w_{t+1} = \arg \min_w J(w_t) + \langle \nabla J(w_t), w-w_t \rangle + \frac{1}{2} (w-w_t)^T H_t (w-w_t) </math> | ||
+ | |||
+ | In Quasi-Newton methods, we use a Newton update to update the weight, in other words we use the Hessian and the gradient to come up with a direction in which to update: | ||
+ | <math>w_{t+1} = w_t -\eta_t H_t^{-1} \nabla J(w_t)</math>, where <math>\eta_t</math> is the step size, which is usually found/tuned by a line search. The line search needs to match the Wolfe Conditions, which intuitively make sure there is a sufficient decrease in objective function while respecting the curvature condition (of convexity). | ||
+ | |||
+ | The exact Hessian is very expensive to calculate, and the full exact Hessian is very expensive to invert. We wish to come up with an iterative approximation to the inverse Hessian, let us call it <math>B</math>. Then, | ||
+ | <math> B_{t+1} = \arg \min_B || B - B_t||_w \textrm{s.t.} s_t = By_t</math>, where <math>y_t = \nabla J(w_{t+1}) - \nabla J (w_t) </math> is the gradient difference between iterations, and <math>s_t = w_{t+1} - w_t</math> is the parameter difference between iterations. After a bit of math, this results in the following update (<math>I</math> is the identity matrix): | ||
+ | <math>B_{t+1} = \left(I - \frac{s_t y_t^T}{\langle s_t, y_t \rangle}\right)B_t \left(I - \frac{s_t y_t^T}{\langle s_t, y_t \rangle}\right) + \frac{s_t s_t^T}{\langle s_t, y_t \rangle}</math> | ||
+ | |||
+ | The above explains the BFGS method. ''' L-BFGS is essentially the same, except we use a low-rank approximation to B. ''' | ||
+ | |||
+ | For non-smooth optimization functions, there are a few modifications we need to make. Firstly, our local quadratic approximation is no longer well defined. Second, note that since not every subgradient is a descent direction, the descent direction (which we formulated as <math> -B_t \nabla J(w_t)</math> is also not well defined. We adjust our quadratic approximation as: | ||
+ | <math> m_t(w) = \sup_{s \in \partial J (w_t)} \{ J(w_t) + \langle s, w-w_t \rangle + \frac{1}{2} (w-w_t)^T H_t (w-w_t)\}</math>. We deal with this supremum by shifting a term from the objective function to a constraint and creating a constrained optimization problem for the weight update: | ||
+ | |||
+ | <math> w_{t+1} = \arg \min_w \frac{1}{2}(w-w_t)^T H_t (w-w_t) + \xi</math> such that <math>J(w_t) + \langle s_i, w-w_t \rangle \leq \xi \textrm{ for } s_1, \dots, s_k \in \partial J(w_t) </math>. Notice that we sample ''k'' subgradients from the set of subgradients (which is a convex set by the way). | ||
+ | |||
+ | === Related Work === | ||
+ | |||
+ | There are a lot of papers that use this method. Here are some examples: | ||
+ | * [[A_Discriminative_Latent_Variable_Model_for_SMT | Blunsom et al, ACL 2008: A Discriminative Latent Variable Model for SMT ]] | ||
+ | * [[Sha_2003_Shallow_Parsing_with_Conditional_Random_Fields | Sha & Perreira, NAACL 2003: Shallow Parsing with Conditional Random Fields ]] | ||
+ | * [[Smith_and_Eisner_2005:Contrastive_Estimation:_Training_Log-Linear_Models_on_Unlabeled_Data | Smith & Eisner, ACL 2005: Contrastive Estimation: Training Log-Linear Models on Unlabeled Data ]] |
Latest revision as of 16:55, 31 October 2011
Contents
Citation
On the Broyden-Fletcher-Goldfarb-Shanno Method, by D.F. Shanno, . In Journal of Optimization Theory and Applications, 1985.
Summary
The L-BFGS algorithm is an optimization method that falls under the group of techniques known as "quasi-Newton" optimization methods. The "L" stands for "Limited" (in the limited memory sense, not that the method is necessarily limited), and BFGS are the individuals who came up with the original (non-limited memory variant) algorithm: Broyden, Fletcher, Goldfarb, and Shanno. Here is a picture of all of them:
In general, whereas optimization methods like the cutting planes algorithm, which are in the category of "bundle methods", essentially lower bound the objective function using linear gradients, quasi-Newton algorithms use the gradients to estimate the Hessian (the square matrix of the second-order partial derivatives, think of it as the vector generalization of a second derivative). With the gradient and an approximated Hessian, we can build a quadratic approximation of the objective function.
Details
The standard BFGS approximates the objective function with a locally quadratic optimization: where is the objective function, is the gradient, and is the Hessian.
We update parameters with the weight that minimizes the quadratic approximation:
In Quasi-Newton methods, we use a Newton update to update the weight, in other words we use the Hessian and the gradient to come up with a direction in which to update: , where is the step size, which is usually found/tuned by a line search. The line search needs to match the Wolfe Conditions, which intuitively make sure there is a sufficient decrease in objective function while respecting the curvature condition (of convexity).
The exact Hessian is very expensive to calculate, and the full exact Hessian is very expensive to invert. We wish to come up with an iterative approximation to the inverse Hessian, let us call it . Then, , where is the gradient difference between iterations, and is the parameter difference between iterations. After a bit of math, this results in the following update ( is the identity matrix):
The above explains the BFGS method. L-BFGS is essentially the same, except we use a low-rank approximation to B.
For non-smooth optimization functions, there are a few modifications we need to make. Firstly, our local quadratic approximation is no longer well defined. Second, note that since not every subgradient is a descent direction, the descent direction (which we formulated as is also not well defined. We adjust our quadratic approximation as: . We deal with this supremum by shifting a term from the objective function to a constraint and creating a constrained optimization problem for the weight update:
such that . Notice that we sample k subgradients from the set of subgradients (which is a convex set by the way).
Related Work
There are a lot of papers that use this method. Here are some examples: