Smith and Eisner 2008:Dependency parsing by belief propagation

Citation

Smith, David A. and Jason Eisner (2008). Dependency parsing by belief propagation. Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 145-156, Honolulu, October.

Online version

Smith and Eisner 2008

Summary

This is a crucial paper that presents a loopy Belief Propagation (BP) method for Dependency Parsing, which can also be easily applied to general problems in Named Entity Recognition, Word Alignment, Shallow Parsing, and Constituent Parsing. The paper formulates the dependency parsing problem as a learning and decoding problem on a graphical model with global constraints. The authors show that BP needs only $O(n^{3})$ time to perform approximate inference on a graphical model, with second-order features and latent variables incorporated.

Brief Description of the Method

This paper first introduces the method of formulating the dependency parsing problem as training and decoding on Markov random fields, then discusses the use of Belief Propagation to lower asymptotic runtime during training and decoding. In this section, we will first summarize the method they use to formulate the problem, then briefly describe the method of using BP for this task. Regarding the detailed BP method for general probabilistic graphical models, please refer to the method page:Belief Propagation．

Graphical Models for Dependency Parsing

The input of the graph will be fully observed word sequence $\mathbf {W} ={W_{0},W_{1},W_{2},...,W_{n}}$ . The corresponding parts-of-speech tags can be written as $\mathbf {T} ={T_{0},T_{1},T_{2},...,T_{n}}$ . The dependency arcs between the words i and j can be denoted by $L_{ij}=true$ , where i represents the parent node, and j as the child.

Training and Decoding using BP

Dataset

The author used three languages from the 2007 CoNLL Dependency Parsing Shared Task. The English data were converted from the Penn Treebank, with around 1% of links crossed other links. In terms of the Danish data, it contained slightly more crossing arcs (3% in total). When comparing to these two languages, Dutch was the most non-projective language (11%).

Experimental Results

In the experiment section of this paper, the authors conducted three major experiments. First, they explored whether BP can beat Dynamic Programming (DP), in terms of the efficiency. Secondly, they looked at the non-projective parsing problem, and checked whether high-order features were useful, and how BP could make it tractable. Last but not the least, they have also examined whether global constraints contribute to the accuracy of dependency parsing, under this proposed BP framework. To precisely present the original results in the following subsections, we use the original figures and tables taken from this paper.

Efficiency Evaluation: Comparing to Dynamic Programming (DP)

In Figure 2 and 3, it is clear that BP is much faster than DP under various settings. And when comparing Figure 2 and 3, it is shown that when adding a higher order (more complex model), the gap between BP and DP is widen. Figure 4 shows the speed vs. error trade-off. It is observed that 5 iterations of BP reaches the best speech with lowest error rate. However, note that this comparison was done in a lower-order setting, where the DP approach was still relatively fast.

Accuracy Evaluation: Higher-Order Non-Projective Parsing

In this experiment, the authors attempted to examine whether adding more higher-order features can improve parsing accuracy, under the proposed BP framework. Table 2 shows that by adding "NoCross", "Grand", and "ChildSEQ" features, the system performance significantly outperforms the first-order baseline. Table 2 also shows that even though a hill-climbing variant of DP can improve over the standard DP, but running non-projective BP is much faster and has slightly higher accuracy.

Accuracy Evaluation: Influences of Global Hard Constraints

In this final experiment, the authors investigate the influences of global hard constraints. The Table 3 shows that the idea of using TREE in training is really critical in this work, and global constraints generally improve the overall results.

Related Papers

This paper is related to many papers in three dimensions. First of all, from a natural language parsing perspective, this paper presents a state-of-the-art inference algorithm for dependency parsing. Secondly, from a machine learning and structured prediction point of view, this work is closely related to many other approximation inference algorithms on probabilistic graphical models (e.g. HMMs, CRFs, MRFs, and Bayesian Networks). Finally, the proposed approach might also be applied to other sequential modeling natural language processing tasks, for example, Named Entity Tagging, Parts-of-speech Tagging, and Constituent Parsing. Below shows some of the related papers to this work.