|
|
Line 1: |
Line 1: |
− | ==Under Construction ==
| + | #REDIRECT [[A_high-performance_semi-supervised_learning_method_for_text_chunking]] |
− | {{MyCiteconference | booktitle = Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics | coauthors = T. Zhang | date = 2005| first = R. K.| last = Ando | pages = 1-9| title = A high-performance semi-supervised learning method for text chunking | url = http://acl.ldc.upenn.edu/P/P05/P05-1001.pdf }}
| |
− | | |
− | This [[Category::Paper]] is available online [http://acl.ldc.upenn.edu/P/P05/P05-1001.pdf].
| |
− | | |
− | == Summary ==
| |
− | | |
− | This paper describes an exponentiated gradient (EG) algorithm for training conditional log-linear models. Conditional log-linear models are used for several key structured prediction tasks such as [[AddressesProblem::Named Entity Recognition | NER]], [[AddressesProblem::POS Tagging | POS tagging]], [[AddressesProblem::Parsing | Parsing]].
| |
− | | |
− | In this paper, they propose a fast and efficient algorithm for optimizing log-linear models such as [[RelatedPaper::Lafferty_2001_Conditional_Random_Fields | CRFs]].
| |
− | | |
− | The common practice of optimizing the conditional log likelihood of a CRF is often via [[conjugate-gradient]] or [[L-BFGS]] algorithms ([[RelatedPaper::Sha_2003_shallow_parsing_with_conditional_random_fields | Sha & Pereira, 2003]]), which typically would require at least one pass through the entire dataset before updating the weight vector. The author's approach here is an online algorithm based on exponentiated gradient updates ([[RelatedPaper::Kivinen & Warmuth JCSS 1997 | Kevin & Warmuth, 1997]]).
| |
− | | |
− | == Brief description of the method ==
| |
− | | |
− | Consider a supervised learning setting with objects <math>x\in\mathcal{X}</math> and corresponding labels <math>y\in\mathcal{Y}</math>, which maybe trees, sequences or other high dimensional structure. Also, assume we are given a function <math>\phi(x,y)</math> that maps <math>(x,y)</math> pairs to feature vectors <math>\mathcal{R}^d</math>. Given a parameter vector <math>\mathbf{w}\in\mathcal{R}^d</math>, a conditional log-linear model defines a distribution over labels as:
| |
− | | |
− | <math>p(y|x;\mathbf{w})=\frac{1}{Z_x}\exp\left(\mathbf{w}\phi(x,y)\right)</math>
| |
− | | |
− | where <math>Z_x</math> is a partition function.
| |
− | | |
− | The problem of learning <math>\mathbf{w}</math> from the training data is thus finding <math>\mathbf{w}</math> which maximizes the regularized log-likelihood:
| |
− | | |
− | <math>\mathbf{w}^{*}=\arg\max_w\sum_i\log p(y_i|x_i;\mathbf{w})-\frac{C}{2}\lVert\mathbf{w}\rVert^2</math>
| |
− | | |
− | where <math>C</math> is the regularization parameter. The above equation has a convex dual which is derived in [[RelatedPaper::Lebanon and Lafferty NIPS 2001]]. With dual variables <math>\alpha_{i,y}</math>, and <math>\mathbf{\alpha}=[\mathbf{\alpha}_1, \mathbf{\alpha}_2, \cdots, \mathbf{\alpha}_n]</math>, we define:
| |
− | | |
− | <math>
| |
− | Q(\mathbf{\alpha})=\sum_i\sum_y \alpha_{i,y}\log\alpha_{i,y}+\frac{1}{2C}\lVert\mathbf{w}(\alpha)\rVert^2
| |
− | </math>
| |
− | | |
− | where
| |
− | <math>
| |
− | \mathbf{w}(\alpha)=\sum_i\sum_y\alpha_{i,y}\left(\phi(x_i,y_i)-\phi(x_i,y)\right)
| |
− | </math>
| |
− | | |
− | The dual problem is thus
| |
− | | |
− | <math>\alpha^*=\arg\min_{\alpha\in\Delta^n} Q(\alpha)</math>
| |
− | | |
− | == EG Algorithm ==
| |
− | | |
− | Given a set of distributions <math>\alpha\in\Delta^n</math>, the update equations are
| |
− | | |
− | <math>\alpha^'_{i,y}=\frac{1}{Z}\alpha_{i,y}\exp(-\eta\nabla_{i,y})</math>
| |
− | | |
− | where
| |
− | | |
− | <math>Z_i=\sum_\hat{y}\alpha_{i,\hat{y}}\exp(-\eta\nabla_{i,\hat{y}})</math>
| |
− | | |
− | and
| |
− | | |
− | <math>\nabla{i,y}=\frac{\partial Q(\alpha)}{\partial\alpha_{i,y}}=1+\log\alpha_{i,y}+\frac{1}{C}\mathbf{w}(\alpha)\cdot\left(\phi(x_i,y_i)-\phi(x_i,y)\right)</math>
| |
− | | |
− | === Batch learning ===
| |
− | | |
− | At each iteration, <math>\alpha'</math> is updated simultaneously with all (or subset of) the available training instances.
| |
− | | |
− | === Online learning ===
| |
− | | |
− | At each iteration, we choose a single training instance, and update <math>\alpha'</math>
| |
− | | |
− | === Convergence rate of batch algorithm ===
| |
− | | |
− | To get within <math>\epsilon</math> of the optimum parameters, we need <math>O(\frac{1}{\eta\epsilon})</math> iterations.
| |
− | | |
− | == Experimental Result ==
| |
− | | |
− | The authors compared the performance of the EG algorithm to conjugated-gradient and L-BFGS methods.
| |
− | | |
− | === Multiclass classification ===
| |
− | | |
− | The authors used a subset of the MNIST handwritten digits classification.
| |
− | | |
− | [[File:multiclass.png]]
| |
− | | |
− | It can be seen that the EG algorithm converges considerably faster than the other methods.
| |
− | | |
− | === Structured learning (dependency parsing) ===
| |
− | | |
− | The author used the Slovene data in [[UsesDataset:CoNLL-X]] Shared Task on Multilingual dependency parsing.
| |
− | | |
− | [[File:depparse.png]]
| |
− | | |
− | It can be seen that the EG algorithm converges faster in terms of objective function and accuracy measures.
| |
− | | |
− | == Related Papers ==
| |
− | | |
− | The approach here is also similar to the use of EG algorithms for large margin structured classification in [[RelatedPaper::Bartlett et al NIPS 2004]].
| |