# Taskar et al. 2004. Max-margin Parsing

Max-margin parsing, by Ben Taskar, Taskar, B. and Klein, D. and Collins, M. and Koller, D. and Manning, C.. In Proc. EMNLP, 2004.

This Paper is available online [1].

## Summary

This paper presents a novel approach to Parsing by maximizing separating margins using Support Vector Machines. They show how we can reformulate the parsing problem as a discriminative task, which allow an arbitrary number of features to be used. Also, such a formulation allows them to incorporate a loss function that directly penalizes incorrect parse trees appropriately.

## Brief description of the method

Instead of a probabilistic interpretation for parse trees, we seek to find:

${\displaystyle y_{i}=\arg \max _{y\in \mathbf {G} (x_{i})}\langle \mathbf {w} ,\Phi (x_{i},y)\rangle }$

for all sentences ${\displaystyle x_{i}}$ in the training data, ${\displaystyle y_{i}}$ being the parse tree, ${\displaystyle \mathbf {G} (x_{i})}$ the set of possible parses for ${\displaystyle x_{i}}$.

Formulating it as an optimization problem,

${\displaystyle \max _{\gamma }\{\langle \mathbf {w} ,\Phi _{i,y_{i}}-\Phi _{i,y}\rangle \geq \gamma L_{i,y}\}\forall y\in \mathbf {G} (x_{i}),||\mathbf {w} ||^{2}\leq 1}$

Using SVM, we can find the dual of the above program

${\displaystyle \max C\sum _{i,y}\alpha _{i,y}L_{i,y}-{\dfrac {1}{2}}||C\sum _{i,y}(I_{i,y}-\alpha _{i,y})\Phi _{i,y}||}$

s.t ${\displaystyle \sum _{y}\alpha _{i,y}=1,\forall i;\alpha _{i,y}\geq 0,\forall i,y}$

where ${\displaystyle I_{i,y}}$ indicates whether ${\displaystyle y}$ is the true parse for sentence ${\displaystyle i}$

For each sentence, we need to enumerate all possible parse trees, which is exponential in size. However, we can make use of local substructures similar to chart parsing dynamic programming algorithm to factor these trees into parts like ${\displaystyle \langle A,s,e,i\rangle }$ and ${\displaystyle \langle A\rightarrow BC,s,m,e,i\rangle }$, where ${\displaystyle s,m,e,i}$ refers to start, split, end points and sentence number respectively.

Therefore,

${\displaystyle \Phi (x,y)=\sum _{r\in R(x,y)}\phi (x,r)}$

where ${\displaystyle R(x,y)}$ is the set of all possible parts. ${\displaystyle \phi }$ can be any function that maps a rule production part to some feature vector representation. In addition, the loss function can also be decomposed into sum of parts similar to above. In the paper, the loss function used was the number of constituent errors made in a parse.

By incorporating parts, the factored dual objective can be expressed in polynomial number of variables, which is in fact cubic in the length of the sentence.

## Results

Experiments on the Penn Treebank dataset with lexical features achieved 0.43 f-score over the Collins 99 parser.

## Related Papers

McDonald_et_al,_ACL_2005:_Non-projective_dependency_parsing_using_spanning_tree_algorithms Margin learning for dependency parsing

Tsochantaridis,_Joachims_,_Support_vector_machine_learning_for_interdependent_and_structured_output_spaces_2004 Using SVMs for structured output space.