Headden et al. NAACL 09

Improving Unsupervised Dependency Parsing with Richer Contexts and Smoothing, by W. P. Headden III, W Headden III, M Johnson, D McClosky. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, 2009.

This Paper is available online [1].

Summary

This paper improves on unsupervised dependency parsing by introducing basic valence frames and lexical information. Smoothing is also performed to leverage on this additional information. Their model produces 10 percentage points improvements over previous work in unsupervised (dependency) grammar induction.

Brief description of the method

The paper builds upon the Dependency Model with Valence by Klein and Manning (2004). The DMV is a generative model in which the head of a sentence is generated and then each head recursively generates its left and right dependents. The arguments of the head in a certain direction are generated repeatedly by deciding whether to generate a new argument or to stop.

The dependency models used in the paper are framed in split-head bilexical CFGs (Eisner and Satta, 1999), which has a fast parsing algorithm to compute the expectations required by Variational Bayes.

Enriching contexts with argument order

DMV models distributions over arguments identically without considering the order they are generated. The model used in the paper, EVG, distinguishes the distribution over the argument nearest to the head from the distribution of the subsequent argument. For instance, consider the phrase "the big dog", we would expect the distribution for the nearest argument "big" to be different from that of a further argument "the". In the figure below, we see that this is captured using different nonterminals referring to nearest/further arguments.

Lexicalization

Lexical information is incorporated into EVG (L-EVG) by extending the EVG CFG to allow nonterminals to be annotated with both the word and the POS tag of the head.

Smoothing

EVG smooths its parameters by linear interpolation. They represent linear interpolation in their PCFG with tied rule probabilities. The smoothing weights of the are accomplished by setting the Dirichlet hyperparameters for their tied PCFG. By setting a larger hyperparameter for the backoff distribution's "rule" would imply that after seeing sufficiently large number of examples, the model will start to ignore it.

The author's method of combines linear interpolation with a Bayesian prior results in an augmented PCFG which is essentially still a PCFG, making it amenable to standard estimation techniques.

Experimental Result

The authors trained on the standard Penn Treebank WSJ corpus. Evaluating against the gold standard dependencies in section 23, they obtained the following results.

Using just smoothing, there was a large improvement over the baseline DMV model.

Also, the model was able to learn the most likely argument types for different valence and directions.