# McDonald et al, ACL 2005: Non-Projective Dependency Parsing Using Spanning Tree Algorithms

## Citation

R. McDonald, F. Pereira, K. Ribarov, J. Hajič. Non-projective dependency parsing using spanning tree algorithms, Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing, pp. 523-530, Vancouver, October 2005.

## Summary

This paper addresses the problem of non-projective dependency parsing.

Given a sentence ${\displaystyle \mathbf {x} =(x_{1},\ldots ,x_{n})}$, we can construct a directed graph ${\displaystyle \mathbf {G_{x}} =(V,E)}$. The vertex set ${\displaystyle V}$ contains one vertex ${\displaystyle v_{i}}$ for each word ${\displaystyle x_{i}}$ in the sentence, plus a dummy vertex ${\displaystyle v_{0}}$ for the "root". The edge set ${\displaystyle E}$ contains all directed edges of the form ${\displaystyle (v_{i},v_{j})}$, where ${\displaystyle 0\leq i\leq n}$, ${\displaystyle 1\leq j\leq n}$, and ${\displaystyle i\neq j}$. Each edge has a score of the form ${\displaystyle s(i,j)=\mathbf {w} \cdot \mathbf {f} (i,j)}$, where ${\displaystyle \mathbf {f} (i,j)}$ is a feature vector depending on the words ${\displaystyle x_{i}}$ and ${\displaystyle x_{j}}$, and ${\displaystyle \mathbf {w} }$ is a weight factor.

A dependency parse tree ${\displaystyle \mathbf {y} }$ is a subgraph of ${\displaystyle \mathbf {G_{x}} }$ which covers all the vertices in ${\displaystyle V}$, and in which each vertex has exactly one predecessor (except for the root vertex which has no predecessor). A projective dependency parse tree has the additional constraint that each of its subtrees covers a contiguous region of the sentence. In either case, the score of a dependency tree ${\displaystyle \mathbf {y} }$ is factored as the sum of the scores of its edges:

${\displaystyle s(\mathbf {x} ,\mathbf {y} )=\sum _{(i,j)\in \mathbf {y} }\mathbf {w} \cdot \mathbf {f} (i,j)}$

The (projective) decoding problem is to find the max-scoring (projective) dependency parse tree ${\displaystyle \mathbf {y} }$ given a sentence ${\displaystyle \mathbf {x} }$, assuming that the weight vector ${\displaystyle \mathbf {w} }$ is known. The learning problem is to find an optimal weight vector ${\displaystyle \mathbf {w} }$, such that the sum of the scores of the dependency parse trees in a training corpus is maximized.

### The Decoding Problem

The non-projective decoding problem is equivalent to finding the maximum spanning tree in a directed graph (also called the maximum arborescence). This can be solved using the Chu-Liu-Edmonds algorithm [1].

A naive implementation of the Chu-Liu-Edmonds algorithm has a time complexity of ${\displaystyle O(|V|^{3})}$. In 1977, Robert Tarjan implemented the algorithm with ${\displaystyle O(|E|\log |V|)}$ complexity for sparse graphs and ${\displaystyle O(|V|^{2})}$ complexity for dense graphs [2], the latter of which is used by this paper. In 1986, Gabow, Galil, Spencer, and Tarjan made an even faster implementation with a complexity of ${\displaystyle O(|E|+|V|\log |V|)}$ [3].

A major advantage of the maximum spanning tree solution over previous solutions is its uniformity and simplicity. Previous algorithms for non-projective dependency parsing were modifications to the Eisner algorithm (a dynamic programming algorithm of complexity ${\displaystyle O(|V|^{3})}$), and often involve approximation. In contrast, the maximum spanning tree solution searches the entire space of dependency parse trees, and it reveals the fact that non-projective dependency parsing is actually easier than projective dependency parsing.

### The Learning Problem

An online large-margin learning algorithm, called MIRA, is used to train the weight vector ${\displaystyle \mathbf {w} }$. The algorithm passes through the training corpus multiple times, and for each training example ${\displaystyle \mathbf {x} _{t},\mathbf {y} _{t}}$, it updates the weight vector so that the scores of ${\displaystyle \mathbf {y} _{t}}$ and any other parse tree ${\displaystyle \mathbf {y} '}$ are separated at least by a loss function ${\displaystyle L(\mathbf {y} _{t},\mathbf {y} ')}$. The loss function is defined as the number of vertices that have different parents in the two trees. The weight vectors after each update are averaged to yield the final weight vector.

Step 4 in the MIRA algorithm is an optimization problem with exponentially many constraints. In order to make the optimization tractable, the paper proposes two solutions. The first is called single-best MIRA, where only one constraint corresponding to the max-scoring parse tree is considered:

The second is called factored MIRA, where the constraints are factored down to the edges:

resulting in ${\displaystyle O(|V|^{2})}$ constraints.

Both variations are easy to implement. The factored MIRA is more restrictive than the original problem, and may therefore rule out the optimal parse tree.

## Experiments

### Dataset

The experiments use two datasets: Prague Dependency Treebank and Penn Treebank. The former is a corpus of dependency parse trees of sentences in Czech, a language more non-projective than English. The entire corpus is referred to as Czech-A, and its subset of sentences with non-projective parse trees (23% of all sentences) is referred to as Czech-B. The Penn Treebank is used to evaluate the performance of the non-projective dependency parsing algorithm on a dominantly projective language (English).

### Criteria

• Accuracy: Percentage of words with their parents correctly identified.
• Complete: Percentage of dependency parse trees correctly reconstructed.

### Involved Systems

Previous systems:

• COLL1999: The projective lexicalized phrase-structure parse of Collins et al. (1999).
• N&N2005: The pseudo-projective parse of Nivre and Nilsson (2005).
• McD2005: The projective parser of McDonald et al. (2005) that uses the Eisner algorithm for both training and testing. This system uses 5-best MIRA.

Proposed systems:

• Single-best MIRA
• Factored MIRA

### Results

The proposed systems perform better for a highly non-projective language (Czech), but less well for a dominantly projective language (English).