Difference between revisions of "Ando and Zhang ACL 2005"

Latest revision as of 17:10, 24 September 2011

Redirect to:

A high-performance semi-supervised learning method for text chunking

@@ Line 1: / Line 1: @@
-==Under Construction ==
+#REDIRECT [[A_high-performance_semi-supervised_learning_method_for_text_chunking]]
-{{MyCiteconference | booktitle = Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics | coauthors = T. Zhang | date = 2005| first = R. K.| last = Ando | pages = 1-9| title = A high-performance semi-supervised learning method for text chunking | url = http://acl.ldc.upenn.edu/P/P05/P05-1001.pdf }}
-This [[Category::Paper]] is available online [http://acl.ldc.upenn.edu/P/P05/P05-1001.pdf].
-== Summary ==
-This paper describes an exponentiated gradient (EG) algorithm for training conditional log-linear models. Conditional log-linear models are used for several key structured prediction tasks such as [[AddressesProblem::Named Entity Recognition | NER]], [[AddressesProblem::POS Tagging | POS tagging]], [[AddressesProblem::Parsing | Parsing]].
-In this paper, they propose a fast and efficient algorithm for optimizing log-linear models such as  [[RelatedPaper::Lafferty_2001_Conditional_Random_Fields | CRFs]].
-The common practice of optimizing the conditional log likelihood of a CRF is often via [[conjugate-gradient]] or [[L-BFGS]] algorithms ([[RelatedPaper::Sha_2003_shallow_parsing_with_conditional_random_fields | Sha & Pereira, 2003]]), which typically would require at least one pass through the entire dataset before updating the weight vector. The author's approach here is an online algorithm based on exponentiated gradient updates ([[RelatedPaper::Kivinen & Warmuth JCSS 1997 | Kevin & Warmuth, 1997]]).
-== Brief description of the method ==
-Consider a supervised learning setting with objects <math>x\in\mathcal{X}</math> and corresponding labels <math>y\in\mathcal{Y}</math>, which maybe trees, sequences or other high dimensional structure. Also, assume we are given a function <math>\phi(x,y)</math> that maps <math>(x,y)</math> pairs to feature vectors <math>\mathcal{R}^d</math>. Given a parameter vector <math>\mathbf{w}\in\mathcal{R}^d</math>, a conditional log-linear model defines a distribution over labels as:
-<math>p(y|x;\mathbf{w})=\frac{1}{Z_x}\exp\left(\mathbf{w}\phi(x,y)\right)</math>
-where <math>Z_x</math> is a partition function.
-The problem of learning <math>\mathbf{w}</math> from the training data is thus finding <math>\mathbf{w}</math> which maximizes the regularized log-likelihood:
-<math>\mathbf{w}^{*}=\arg\max_w\sum_i\log p(y_i|x_i;\mathbf{w})-\frac{C}{2}\lVert\mathbf{w}\rVert^2</math>
-where <math>C</math> is the regularization parameter. The above equation has a convex dual which is derived in [[RelatedPaper::Lebanon and Lafferty NIPS 2001]]. With dual variables <math>\alpha_{i,y}</math>, and <math>\mathbf{\alpha}=[\mathbf{\alpha}_1, \mathbf{\alpha}_2, \cdots, \mathbf{\alpha}_n]</math>, we define:
-<math>
-Q(\mathbf{\alpha})=\sum_i\sum_y \alpha_{i,y}\log\alpha_{i,y}+\frac{1}{2C}\lVert\mathbf{w}(\alpha)\rVert^2
-</math>
-where
-<math>
-\mathbf{w}(\alpha)=\sum_i\sum_y\alpha_{i,y}\left(\phi(x_i,y_i)-\phi(x_i,y)\right)
-</math>
-The dual problem is thus
-<math>\alpha^*=\arg\min_{\alpha\in\Delta^n} Q(\alpha)</math>
-== EG Algorithm ==
-Given a set of distributions <math>\alpha\in\Delta^n</math>, the update equations are
-<math>\alpha^'_{i,y}=\frac{1}{Z}\alpha_{i,y}\exp(-\eta\nabla_{i,y})</math>
-where
-<math>Z_i=\sum_\hat{y}\alpha_{i,\hat{y}}\exp(-\eta\nabla_{i,\hat{y}})</math>
-and
-<math>\nabla{i,y}=\frac{\partial Q(\alpha)}{\partial\alpha_{i,y}}=1+\log\alpha_{i,y}+\frac{1}{C}\mathbf{w}(\alpha)\cdot\left(\phi(x_i,y_i)-\phi(x_i,y)\right)</math>
-=== Batch learning ===
-At each iteration, <math>\alpha'</math> is updated simultaneously with all (or subset of) the available training instances.
-=== Online learning ===
-At each iteration, we choose a single training instance, and update <math>\alpha'</math>
-=== Convergence rate of batch algorithm ===
-To get within <math>\epsilon</math> of the optimum parameters, we need <math>O(\frac{1}{\eta\epsilon})</math> iterations.
-== Experimental Result ==
-The authors compared the performance of the EG algorithm to conjugated-gradient and L-BFGS methods.
-=== Multiclass classification ===
-The authors used a subset of the MNIST handwritten digits classification.
-[[File:multiclass.png]]
-It can be seen that the EG algorithm converges considerably faster than the other methods.
-=== Structured learning (dependency parsing) ===
-The author used the Slovene data in [[UsesDataset:CoNLL-X]] Shared Task on Multilingual dependency parsing.
-[[File:depparse.png]]
-It can be seen that the EG algorithm converges faster in terms of objective function and accuracy measures.
-== Related Papers ==
-The approach here is also similar to the use of EG algorithms for large margin structured classification in [[RelatedPaper::Bartlett et al NIPS 2004]].

Difference between revisions of "Ando and Zhang ACL 2005"

Latest revision as of 17:10, 24 September 2011

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

Tools