Difference between revisions of "McDonald et al, ACL 2005: Non-Projective Dependency Parsing Using Spanning Tree Algorithms"

From Cohen Courses
Jump to navigationJump to search
Line 13: Line 13:
 
Given a sentence <math>\mathbf{x} = (x_1, \ldots, x_n)</math>, we can construct a directed graph <math>\mathbf{G_x} = (V,E)</math>. The vertex set <math>V</math> contains one vertex <math>v_i</math> for each word <math>x_i</math> in the sentence, plus a dummy vertex <math>v_0</math> for the "root". The edge set <math>E</math> contains all directed edges of the form <math>(v_i, v_j)</math>, where <math>0 \le i \le n</math>, <math>1 \le j \le n</math>, and <math>i \ne j</math>. Each edge has a score of the form <math>s(i,j) = \mathbf{w} \cdot \mathbf{f}(i,j)</math>, where <math>\mathbf{f}(i,j)</math> is a feature vector depending on the words <math>x_i</math> and <math>x_j</math>, and <math>\mathbf{w}</math> is a weight factor.
 
Given a sentence <math>\mathbf{x} = (x_1, \ldots, x_n)</math>, we can construct a directed graph <math>\mathbf{G_x} = (V,E)</math>. The vertex set <math>V</math> contains one vertex <math>v_i</math> for each word <math>x_i</math> in the sentence, plus a dummy vertex <math>v_0</math> for the "root". The edge set <math>E</math> contains all directed edges of the form <math>(v_i, v_j)</math>, where <math>0 \le i \le n</math>, <math>1 \le j \le n</math>, and <math>i \ne j</math>. Each edge has a score of the form <math>s(i,j) = \mathbf{w} \cdot \mathbf{f}(i,j)</math>, where <math>\mathbf{f}(i,j)</math> is a feature vector depending on the words <math>x_i</math> and <math>x_j</math>, and <math>\mathbf{w}</math> is a weight factor.
  
A (non-projective) dependency parse tree <math>\mathbf{y}</math> is a subgraph of <math>\mathbf{G_x}</math> which covers all the vertices in <math>V</math>, and in which each vertex has exactly one predecessor (except for the root vertex which has no predecessor). The score of a dependency tree <math>\mathbf{y}</math> is factored as the sum of the scores of its edges:
+
A '''dependency parse tree''' <math>\mathbf{y}</math> is a subgraph of <math>\mathbf{G_x}</math> which covers all the vertices in <math>V</math>, and in which each vertex has exactly one predecessor (except for the root vertex which has no predecessor). A '''projective dependency parse tree''' has the additional constraint that each of its subtrees covers a contiguous region of the sentence. In either case, the score of a dependency tree <math>\mathbf{y}</math> is factored as the sum of the scores of its edges:
 +
 
 
<math>s(\mathbf{x}, \mathbf{y}) = \sum_{(i,j) \in \mathbf{y}} \mathbf{w} \cdot \mathbf{f}(i,j)</math>
 
<math>s(\mathbf{x}, \mathbf{y}) = \sum_{(i,j) \in \mathbf{y}} \mathbf{w} \cdot \mathbf{f}(i,j)</math>
  
The '''decoding problem''' is to find the max-scoring dependency parse tree <math>\mathbf{y}</math> given a sentence <math>\mathbf{x}</math>, assuming that the weight vector <math>\mathbf{w}</math> is known. The '''learning problem''' is to find an optimal weight vector <math>\mathbf{w}</math>, such that the sum of the scores of the dependency parse trees in a training corpus is maximized.
+
The (projective) '''decoding problem''' is to find the max-scoring (projective) dependency parse tree <math>\mathbf{y}</math> given a sentence <math>\mathbf{x}</math>, assuming that the weight vector <math>\mathbf{w}</math> is known. The '''learning problem''' is to find an optimal weight vector <math>\mathbf{w}</math>, such that the sum of the scores of the dependency parse trees in a training corpus is maximized.
  
 
=== The Decoding Problem ===
 
=== The Decoding Problem ===
 +
 +
The non-projective decoding problem is equivalent to finding the '''maximum spanning tree''' in a directed graph (also called the '''maximum arborescence'''). This can be solved using the [[UsesMethod::Chu-Liu-Edmonds algorithm]] (http://en.wikipedia.org/wiki/Edmonds'_algorithm Wikipedia]).
 +
 +
[[File:Chu-Liu-Edmonds algorithm.png]]
 +
 +
A naive implementation of the Chu-Liu-Edmonds algorithm has a time complexity of <math>O(|V|^3)</math>. In 1977, Robert Tarjan implemented the algorithm with <math>O(|E|log|V|)</math> complexity for sparse graphs and <math>O(|V|^2)</math> complexity for dense graphs, the latter of which is used by this paper. In 1986, Gabow, Galil, Spencer, and Tarjan made an even faster implementation with a complexity of <math>O(|E| + |V|log|V|)</math>.
 +
 +
A major advantage of the maximum spanning tree solution over previous solutions is its uniformity and simplicity. Previous algorithms for non-projective dependency parsing were modifications to the [[Eisner algorithm]] (a dynamic programming algorithm of complexity <math>O(|V|^3)</math>), and often involve approximation. In contrast, the maximum spanning tree solution searches the entire space of dependency parse trees, and it reveals the fact that non-projective dependency parsing is actually easier than projective dependency parsing.
  
 
=== The Learning Problem ===
 
=== The Learning Problem ===
 +
  
  
 
== Experiments ==
 
== Experiments ==

Revision as of 17:48, 22 October 2011

Citation

R. McDonald, F. Pereira, K. Ribarov, J. Hajič. Non-projective dependency parsing using spanning tree algorithms, Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing, pp. 523-530, Vancouver, October 2005.

Online Version

PDF version

Summary

This paper addresses the problem of non-projective dependency prasing.

Given a sentence , we can construct a directed graph . The vertex set contains one vertex for each word in the sentence, plus a dummy vertex for the "root". The edge set contains all directed edges of the form , where , , and . Each edge has a score of the form , where is a feature vector depending on the words and , and is a weight factor.

A dependency parse tree is a subgraph of which covers all the vertices in , and in which each vertex has exactly one predecessor (except for the root vertex which has no predecessor). A projective dependency parse tree has the additional constraint that each of its subtrees covers a contiguous region of the sentence. In either case, the score of a dependency tree is factored as the sum of the scores of its edges:

The (projective) decoding problem is to find the max-scoring (projective) dependency parse tree given a sentence , assuming that the weight vector is known. The learning problem is to find an optimal weight vector , such that the sum of the scores of the dependency parse trees in a training corpus is maximized.

The Decoding Problem

The non-projective decoding problem is equivalent to finding the maximum spanning tree in a directed graph (also called the maximum arborescence). This can be solved using the Chu-Liu-Edmonds algorithm (http://en.wikipedia.org/wiki/Edmonds'_algorithm Wikipedia]).

Chu-Liu-Edmonds algorithm.png

A naive implementation of the Chu-Liu-Edmonds algorithm has a time complexity of . In 1977, Robert Tarjan implemented the algorithm with complexity for sparse graphs and complexity for dense graphs, the latter of which is used by this paper. In 1986, Gabow, Galil, Spencer, and Tarjan made an even faster implementation with a complexity of .

A major advantage of the maximum spanning tree solution over previous solutions is its uniformity and simplicity. Previous algorithms for non-projective dependency parsing were modifications to the Eisner algorithm (a dynamic programming algorithm of complexity ), and often involve approximation. In contrast, the maximum spanning tree solution searches the entire space of dependency parse trees, and it reveals the fact that non-projective dependency parsing is actually easier than projective dependency parsing.

The Learning Problem

Experiments