Smith and Eisner 2008:Dependency parsing by belief propagation

Citation

Smith, David A. and Jason Eisner (2008). Dependency parsing by belief propagation. Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 145-156, Honolulu, October.

Online version

Smith and Eisner 2008

Summary

This is a crucial paper that presents a loopy Belief Propagation (BP) method for Dependency Parsing, which can also be easily applied to general problems in Named Entity Recognition, Word Alignment, Shallow Parsing, and Constituent Parsing. The paper formulates the dependency parsing problem as a learning and decoding problem on a graphical model with global constraints. The authors show that BP needs only $O(n^{3})$ time to perform approximate inference on a graphical model, with second-order features and latent variables incorporated.

Brief Description of the Method

This paper first introduces the method of formulating the dependency parsing problem as training and decoding on Markov random fields, then discusses the use of Belief Propagation to lower asymptotic runtime during training and decoding. In this section, we will first summarize the method they use to formulate the problem, then briefly describe the method of using BP for this task. Regarding the detailed BP method for general probabilistic graphical models, please refer to the method page:　Belief Propagation．

Graphical Models for Dependency Parsing

The input of the graph will be fully observed word sequence $\mathbf {W} ={W_{0},W_{1},W_{2},...,W_{n}}$ . The corresponding parts-of-speech tags can be written as $\mathbf {T} ={T_{0},T_{1},T_{2},...,T_{n}}$ . The dependency arcs between the words i and j can be denoted by $L_{ij}=true$ , where i represents the parent node, and j as the child. Then, a probability distribution of the assignment of all variables can be represented using the Markov random field (MRF).

p(\Lambda ){\overset {\underset {\mathrm {def} }{}}{=}}{\frac {1}{Z(\theta )}}\prod {m}F_{m}(\Lambda )

Here, we can say that there exists a function that consults a subset of $\Lambda$ , and the function has either unary, binary, ternary or global degrees. Given a learned parameter $\theta$ and a factor function $F_{m}(\Lambda )$ , we can decode the possible dependency parses.

Training and Decoding using BP

The basic idea for training is to train a weight vector $\theta$ that maximizes the log-likelihood of the training data. Note that the difficulty here is to calculate the gradient of the objective function

\nabla _{\theta }\log Z=\sum _{m}E_{p(\Lambda )}[\nabla _{\theta }F_{m}(\Lambda )]

Remember that in Belief Propagation, we use an estimate of the marginal distribution, which can be viewed as the belief of the factor $F_{m}$ , given W and \theta. The authors then use the stochastic gradient descent to estimate its gradient, rather than compute the objection function.

BP is used to compute the conditional probability of a link $L_{ij}$ is present. The basic idea here is to use BP and propagate the local factors in the graph. Note that the authors have also employed global factors in this work. So, the trick here is to run backward algorithms from marginal distributions (this is very similar to the forward-backward algorithm for extracting posteriors from hidden Markov models). The key component of decoding task here is to find a single assignment that satisfies the parse TREE constraint. So, a simple approach would be returning the 1-best tree, k-best trees, or take samples. The authors take the full marginal distributions at the links, and feed them to a maximum spanning tree algorithm. The entire process somehow resembles the minimum Bayes risk parsing with a dependency loss function.

Dataset

The author used three languages from the 2007 CoNLL Dependency Parsing Shared Task. The English data were converted from the Penn Treebank, with around 1% of links crossed other links. In terms of the Danish data, it contained slightly more crossing arcs (3% in total). When comparing to these two languages, Dutch was the most non-projective language (11%).

Experimental Results

In the experiment section of this paper, the authors conducted three major experiments. First, they explored whether BP can beat Dynamic Programming (DP), in terms of the efficiency. Secondly, they looked at the non-projective parsing problem, and checked whether high-order features were useful, and how BP could make it tractable. Last but not the least, they have also examined whether global constraints contribute to the accuracy of dependency parsing, under this proposed BP framework. To precisely present the original results in the following subsections, we use the original figures and tables taken from this paper.

Efficiency Evaluation: Comparing to Dynamic Programming (DP)

In Figure 2 and 3, it is clear that BP is much faster than DP under various settings. And when comparing Figure 2 and 3, it is shown that when adding a higher order (more complex model), the gap between BP and DP is widen. Figure 4 shows the speed vs. error trade-off. It is observed that 5 iterations of BP reaches the best speech with lowest error rate. However, note that this comparison was done in a lower-order setting, where the DP approach was still relatively fast.

Accuracy Evaluation: Higher-Order Non-Projective Parsing

In this experiment, the authors attempted to examine whether adding more higher-order features can improve parsing accuracy, under the proposed BP framework. Table 2 shows that by adding "NoCross", "Grand", and "ChildSEQ" features, the system performance significantly outperforms the first-order baseline. Table 2 also shows that even though a hill-climbing variant of DP can improve over the standard DP, but running non-projective BP is much faster and has slightly higher accuracy.

Accuracy Evaluation: Influences of Global Hard Constraints

In this final experiment, the authors investigate the influences of global hard constraints. The Table 3 shows that the idea of using TREE in training is really critical in this work, and global constraints generally improve the overall results.

Related Papers

This paper is related to many papers in three dimensions. First of all, from a natural language parsing perspective, this paper presents a state-of-the-art inference algorithm for dependency parsing. Secondly, from a machine learning and structured prediction point of view, this work is closely related to many other approximation inference algorithms on probabilistic graphical models (e.g. HMMs, CRFs, MRFs, and Bayesian Networks). Finally, the proposed approach might also be applied to other sequential modeling natural language processing tasks, for example, Named Entity Tagging, Parts-of-speech Tagging, and Constituent Parsing. Below shows some of the related papers to this work.