Difference between revisions of "Margin Infused Relaxed Algorithm"
(98 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
− | This [[Category::method]] is used by [[RelatedPaper::Watanabe_et_al.,_EMNLP_2007._Online_Large-Margin_Training_for_Statistical_Machine_Translation|Watanabe et al., EMNLP 2007]] to train an MT system a with a very large number of features of the order of millions. The training step was performed using a specific algorithm called the Margin Infused Relaxed Algorithm (MIRA) proposed by [[RelatedPaper::Koby Crammer, Ofer Dekel, Joseph Keshet, Shai Shalev-Shwartz, Yoram Singer. Online Passive-Aggressive Algorithms. JMLR 7(Mar):551--585, 2006. | Crammer et al., 2006]] | + | This [[Category::method]] is used by [[RelatedPaper::Watanabe_et_al.,_EMNLP_2007._Online_Large-Margin_Training_for_Statistical_Machine_Translation | Watanabe et al., EMNLP 2007]] to train an MT system a with a very large number of features of the order of millions. The training step was performed using a specific algorithm called the Margin Infused Relaxed Algorithm (MIRA) proposed by [[RelatedPaper::Koby Crammer, Ofer Dekel, Joseph Keshet, Shai Shalev-Shwartz, Yoram Singer. Online Passive-Aggressive Algorithms. JMLR 7(Mar):551--585, 2006. | Crammer et al., 2006]]. |
== Summary == | == Summary == | ||
− | MIRA is an online large-margin training algorithm which updates the weight vector according to certain margin constraints and loss function | + | MIRA is an online large-margin training algorithm which updates the weight vector according to certain margin constraints and loss function. |
− | == Definition == | + | It is used to learn the weights of features after processing each training instance similar to the structured perceptron algorithm with an additional loss function and margin constraint in its update rule. |
+ | |||
+ | == General Definition == | ||
+ | A general definition of online training algorithms can be written down as follows: | ||
+ | |||
+ | [[File:Online-training.jpg|center|400px]] | ||
+ | |||
+ | * <math>\forall t, \quad (f^{t}, \mathbf{e}^t) \in \mathcal{T} </math> and a list of <math> m </math> -best oracles <math> \mathcal{O}^t </math>. A <math>k</math> -best list of candidates is generated by <math> best_k(\cdot) </math> using the current weight vector <math>\mathbf{w}_i</math>. Each training instance <math>f_{t}</math> can have a multiple number of correct outputs or references, <math>\mathbf{e}^t</math>, in this case, target translations. | ||
+ | |||
+ | * Using the <math>k</math>-best list, <math>m</math>-best oracle translations <math>\mathcal{O}^t</math> is updated by <math>oracle_{m}(\cdot)</math> in each iteration. | ||
+ | |||
+ | * New weight vector <math>\mathbf{w}^{i+1}</math> is computed using the <math>k</math>-list <math>\mathcal{C}^{t}</math> with respect to the oracle <math>\mathcal{O}^t</math>. | ||
+ | |||
+ | * After <math>N</math> iterations, the algorithm returns an averaged weight vector over <math>T</math> training instances to avoid overfitting. | ||
+ | |||
+ | == MIRA == | ||
+ | The difference in MIRA lies in the weight update rule which differs from one algorithm to the other. It has been widely used in structured classification tasks such as dependency parsing (McDonald et al., 2005) and joint-labeling/chunking (Shimizu and Haas, 2006). | ||
+ | |||
+ | * The basic idea is to keep the norm of the updates to the weight vector as small as possible, | ||
+ | * Considering a margin at least as large as the loss of the incorrect classification. | ||
+ | * The update rule in MIRA is given by: | ||
+ | |||
+ | <math> | ||
+ | \mathbf{\hat{w}}^{i+1} = \underset{w}{\operatorname{argmin}} \quad ||\mathbf{w}^{i+1} - \mathbf{w}^{i}||^2 + C\sum_{\hat{e}, e^\prime}\xi(\hat{e}, e^\prime) | ||
+ | </math> | ||
+ | |||
+ | subject to | ||
+ | |||
+ | <math> | ||
+ | s^{i+1}(f^t, \hat{e}) - s^{i+1}(f^t, e^\prime) + \xi(\hat{e}, e^\prime) \geq L(\hat{e}, e^\prime; \mathbf{e}^t) | ||
+ | </math> | ||
+ | |||
+ | <math> | ||
+ | \xi(\hat{e}, e^\prime) \geq 0 | ||
+ | </math> | ||
+ | |||
+ | <math> | ||
+ | \forall \hat{e} \in \mathcal{O}^t, \quad \forall \hat{e} \in \mathcal{C}^t | ||
+ | </math> | ||
+ | |||
+ | |||
+ | where, | ||
+ | <math> s^i(f^t, e) = \left\{ \mathbf{w}^i \right\}^\mathrm{T} \cdot \mathbf{h}(f^t, e)</math>. | ||
+ | |||
+ | <math>\xi(\cdot) </math> is a nonnegative slack variable and <math> C \geq 0 </math> is a constant to control the influence to the objective function. A larger C implies larger updates to the weight vector. | ||
+ | |||
+ | <math> L(\cdot) </math> is a loss function, that measures the "difference" between <math>\hat{e}</math> and <math>e^\prime</math> according to the set of references <math>\mathbf{e}^t</math>. | ||
+ | |||
+ | |||
+ | A larger error means a larger distance between the scores of the predicted correct and incorrect outputs. | ||
+ | |||
+ | == Dual Form == | ||
+ | The Langrage dual form for MIRA's objective function is given by: | ||
+ | |||
+ | <math> | ||
+ | max_{\alpha(\cdot) \geq 0} \quad -\frac{1}{2}||\sum_{\hat{e}, e^\prime}\alpha(\hat{e}, e^\prime) \left( \mathbf{h}(f^t, \hat{e}) - \mathbf{h}(f^t, e^\prime) \right)||^2 \quad + \quad \sum_{\hat{e}, e^\prime}\alpha(\hat{e}, e^\prime)L(\hat{e}, e^\prime; \mathbf{e}^t) \quad - \quad \sum_{\hat{e}, e^\prime}\alpha(\hat{e}, e^\prime) \left( s^i(f^t, \hat{e}) - s^i(f^t, e^\prime) \right) | ||
+ | </math> | ||
+ | |||
+ | subject to | ||
+ | |||
+ | <math> | ||
+ | \sum_{\hat{e}, e^\prime}\alpha(\hat{e}, e^\prime) \leq C | ||
+ | </math> | ||
+ | |||
+ | with the weight vector update, | ||
+ | |||
+ | <math> | ||
+ | \mathbf{w}^{i+1} = \mathbf{w}^i \; + \; \sum_{\hat{e}, e^\prime}\alpha(\hat{e}, e^\prime) \left( \mathbf{h}(f^t, \hat{e}) - \mathbf{h}(f^t, e^\prime) \right) | ||
+ | </math> | ||
+ | |||
+ | The dual form is solved using a QP-solver, such as a coordinate ascent algorithm, by heuristically selecting <math>(\hat{e}, e^\prime)</math> and by updating <math>\alpha(\cdot)</math> iteratively: | ||
+ | |||
+ | |||
+ | <math> | ||
+ | \alpha(\hat{e}, e^\prime) = max \left\{0, \; (\alpha(\hat{e}, e^\prime) + \delta(\hat{e}, e^\prime)) \right\} | ||
+ | </math> | ||
+ | |||
+ | |||
+ | <math> | ||
+ | \delta(\hat{e}, e^\prime) = \frac{L(\hat{e}, e^\prime; \mathbf{e}^t) - \left( s^i(f^t, \hat{e}) - s^i(f^t, e^\prime) \right)} {||\mathbf{h} (f^t, \hat{e}) - \mathbf{h} (f^t, e^\prime)||^2} | ||
+ | </math> | ||
+ | |||
+ | |||
+ | <math>C</math> is used to clip the amount of updates. | ||
+ | |||
+ | == Related Papers == | ||
+ | [1] [[RelatedPaper::Crammer et al., JMLR, 2006 | Koby Crammer, Ofer Dekel, Joseph Keshet, Shai Shalev-Shwartz, Yoram Singer. Online Passive-Aggressive Algorithms. ''Journal of Machine Learning Research'', 7(Mar):551--585, 2006.]] | ||
+ | |||
+ | [2] [[RelatedPaper::Watanabe_et_al.,_EMNLP_2007._Online_Large-Margin_Training_for_Statistical_Machine_Translation | Taro Watanabe, Jun Suzuki, Hajime Tsukada, Hideki Isozaki. 2007. Online large-margin training for statistical machine translation. In ''Proceedings of EMNLP-CoNLL'', pp 764–773.]] | ||
+ | |||
+ | [3] [[RelatedPaper::Chiang et al., EMNLP 2008 | David Chiang, Yuval Marton and Philip Resnik. 2008. Online Large-Margin Training of Syntactic and Structural Translation Features. In ''Proceedings of EMNLP-2008''.]] |
Latest revision as of 16:44, 29 November 2011
This method is used by Watanabe et al., EMNLP 2007 to train an MT system a with a very large number of features of the order of millions. The training step was performed using a specific algorithm called the Margin Infused Relaxed Algorithm (MIRA) proposed by Crammer et al., 2006.
Summary
MIRA is an online large-margin training algorithm which updates the weight vector according to certain margin constraints and loss function.
It is used to learn the weights of features after processing each training instance similar to the structured perceptron algorithm with an additional loss function and margin constraint in its update rule.
General Definition
A general definition of online training algorithms can be written down as follows:
- and a list of -best oracles . A -best list of candidates is generated by using the current weight vector . Each training instance can have a multiple number of correct outputs or references, , in this case, target translations.
- Using the -best list, -best oracle translations is updated by in each iteration.
- New weight vector is computed using the -list with respect to the oracle .
- After iterations, the algorithm returns an averaged weight vector over training instances to avoid overfitting.
MIRA
The difference in MIRA lies in the weight update rule which differs from one algorithm to the other. It has been widely used in structured classification tasks such as dependency parsing (McDonald et al., 2005) and joint-labeling/chunking (Shimizu and Haas, 2006).
- The basic idea is to keep the norm of the updates to the weight vector as small as possible,
- Considering a margin at least as large as the loss of the incorrect classification.
- The update rule in MIRA is given by:
subject to
where,
.
is a nonnegative slack variable and is a constant to control the influence to the objective function. A larger C implies larger updates to the weight vector.
is a loss function, that measures the "difference" between and according to the set of references .
A larger error means a larger distance between the scores of the predicted correct and incorrect outputs.
Dual Form
The Langrage dual form for MIRA's objective function is given by:
subject to
with the weight vector update,
The dual form is solved using a QP-solver, such as a coordinate ascent algorithm, by heuristically selecting and by updating iteratively:
is used to clip the amount of updates.