Difference between revisions of "Ando and Zhang ACL 2005"

From Cohen Courses
Jump to navigationJump to search
 
Line 1: Line 1:
==Under Construction ==
+
#REDIRECT [[A_high-performance_semi-supervised_learning_method_for_text_chunking]]
{{MyCiteconference | booktitle = Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics | coauthors = T. Zhang | date = 2005| first = R. K.| last = Ando | pages = 1-9| title = A high-performance semi-supervised learning method for text chunking | url = http://acl.ldc.upenn.edu/P/P05/P05-1001.pdf }}
 
 
 
This [[Category::Paper]] is available online [http://acl.ldc.upenn.edu/P/P05/P05-1001.pdf].
 
 
 
== Summary ==
 
 
 
This paper describes an exponentiated gradient (EG) algorithm for training conditional log-linear models. Conditional log-linear models are used for several key structured prediction tasks such as [[AddressesProblem::Named Entity Recognition | NER]], [[AddressesProblem::POS Tagging | POS tagging]], [[AddressesProblem::Parsing | Parsing]].
 
 
 
In this paper, they propose a fast and efficient algorithm for optimizing log-linear models such as  [[RelatedPaper::Lafferty_2001_Conditional_Random_Fields | CRFs]].
 
 
 
The common practice of optimizing the conditional log likelihood of a CRF is often via [[conjugate-gradient]] or [[L-BFGS]] algorithms ([[RelatedPaper::Sha_2003_shallow_parsing_with_conditional_random_fields | Sha & Pereira, 2003]]), which typically would require at least one pass through the entire dataset before updating the weight vector. The author's approach here is an online algorithm based on exponentiated gradient updates ([[RelatedPaper::Kivinen & Warmuth JCSS 1997 | Kevin & Warmuth, 1997]]).
 
 
 
== Brief description of the method ==
 
 
 
Consider a supervised learning setting with objects <math>x\in\mathcal{X}</math> and corresponding labels <math>y\in\mathcal{Y}</math>, which maybe trees, sequences or other high dimensional structure. Also, assume we are given a function <math>\phi(x,y)</math> that maps <math>(x,y)</math> pairs to feature vectors <math>\mathcal{R}^d</math>. Given a parameter vector <math>\mathbf{w}\in\mathcal{R}^d</math>, a conditional log-linear model defines a distribution over labels as:
 
 
 
<math>p(y|x;\mathbf{w})=\frac{1}{Z_x}\exp\left(\mathbf{w}\phi(x,y)\right)</math>
 
 
 
where <math>Z_x</math> is a partition function.
 
 
 
The problem of learning <math>\mathbf{w}</math> from the training data is thus finding <math>\mathbf{w}</math> which maximizes the regularized log-likelihood:
 
 
 
<math>\mathbf{w}^{*}=\arg\max_w\sum_i\log p(y_i|x_i;\mathbf{w})-\frac{C}{2}\lVert\mathbf{w}\rVert^2</math>
 
 
 
where <math>C</math> is the regularization parameter. The above equation has a convex dual which is derived in [[RelatedPaper::Lebanon and Lafferty NIPS 2001]]. With dual variables <math>\alpha_{i,y}</math>, and <math>\mathbf{\alpha}=[\mathbf{\alpha}_1, \mathbf{\alpha}_2, \cdots, \mathbf{\alpha}_n]</math>, we define:
 
 
 
<math>
 
Q(\mathbf{\alpha})=\sum_i\sum_y \alpha_{i,y}\log\alpha_{i,y}+\frac{1}{2C}\lVert\mathbf{w}(\alpha)\rVert^2
 
</math>
 
 
 
where
 
<math>
 
\mathbf{w}(\alpha)=\sum_i\sum_y\alpha_{i,y}\left(\phi(x_i,y_i)-\phi(x_i,y)\right)
 
</math>
 
 
 
The dual problem is thus
 
 
 
<math>\alpha^*=\arg\min_{\alpha\in\Delta^n} Q(\alpha)</math>
 
 
 
== EG Algorithm ==
 
 
 
Given a set of distributions <math>\alpha\in\Delta^n</math>, the update equations are
 
 
 
<math>\alpha^'_{i,y}=\frac{1}{Z}\alpha_{i,y}\exp(-\eta\nabla_{i,y})</math>
 
 
 
where
 
 
 
<math>Z_i=\sum_\hat{y}\alpha_{i,\hat{y}}\exp(-\eta\nabla_{i,\hat{y}})</math>
 
 
 
and
 
 
 
<math>\nabla{i,y}=\frac{\partial Q(\alpha)}{\partial\alpha_{i,y}}=1+\log\alpha_{i,y}+\frac{1}{C}\mathbf{w}(\alpha)\cdot\left(\phi(x_i,y_i)-\phi(x_i,y)\right)</math>
 
 
 
=== Batch learning ===
 
 
 
At each iteration, <math>\alpha'</math> is updated simultaneously with all (or subset of) the available training instances.
 
 
 
=== Online learning ===
 
 
 
At each iteration, we choose a single training instance, and update <math>\alpha'</math>
 
 
 
=== Convergence rate of batch algorithm ===
 
 
 
To get within <math>\epsilon</math> of the optimum parameters, we need <math>O(\frac{1}{\eta\epsilon})</math> iterations.
 
 
 
== Experimental Result ==
 
 
 
The authors compared the performance of the EG algorithm to conjugated-gradient and L-BFGS methods.
 
 
 
=== Multiclass classification ===
 
 
 
The authors used a subset of the MNIST handwritten digits classification.
 
 
 
[[File:multiclass.png]]
 
 
 
It can be seen that the EG algorithm converges considerably faster than the other methods.
 
 
 
=== Structured learning (dependency parsing) ===
 
 
 
The author used the Slovene data in [[UsesDataset:CoNLL-X]] Shared Task on Multilingual dependency parsing.
 
 
 
[[File:depparse.png]]
 
 
 
It can be seen that the EG algorithm converges faster in terms of objective function and accuracy measures.
 
 
 
== Related Papers ==
 
 
 
The approach here is also similar to the use of EG algorithms for large margin structured classification in [[RelatedPaper::Bartlett et al NIPS 2004]].
 

Latest revision as of 16:10, 24 September 2011