Dreyer and Eisner, EMNLP 2006
Better Informed Training of Latent Syntactic Features
This paper can be found at: [1]
Citation
Markus Dreyer and Jason Eisner. Better informed training of latent syntactic features. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 317–326, Sydney, Australia, 2006. DOI: 10.3115/1610075.1610120
Summary
They wanted to improve parsing accuracy by introducing hidden features. It would seem that linguistic features (such as number, gender, etc) would help in constraining the parse trees. They wanted to include these features, but the treebank did not include the features. So, the authors used a modified EM algorithm with simulated annealing to find the features and then construct rules that took advantage of the features. The main contribution is an attempt to improve upon the work of Matsuzaki et al by reducing the number of degrees of freedom to learn so that the syntactic features can take on a greater range of values. The new method allows less freedom in learning transfer probabilities. The end result was no improvement over the previous work.
Previous work
As they say in the paper, "treebanks never contain enough information". Lots of parsing work had been done on splitting the nonterminals to be able to train the important nonterminal better. But the splitting was mostly done in an ad-hoc fashion until Matsuzaki et al (2005) with PCFG-LA (Probabilistic context-free grammar with latent annotations). They wanted to incorporate features that only propagated in "linguistically-motived ways". By only propagating it in linguistically-motivated ways, the authors avoided having to learn way too many weights. The system works the same as a PCFG, but there are many more rules. If we have a regular rule in the PCFG , in PCFG-LA, it would be represented as where and are the features values (an integer).
Improvements in Paper
The main idea in the paper was to reduce the number of weights that needed to be learned. This is unequivocally a good thing since it means that either the model is smaller or you can train more weights for the same price. The authors decided that there were too many transfer probabilities. Instead of calculating the weights for any value of to any values of and , the probabilities they calculated were . is the probability of annotating the nonterminal with some feature value given that it did not inherit the feature value from its parent. This greatly reduces the number of probabilities to calculate since we're grouping a bunch of the weights all together.