# Koo et al, ACL 2008

Simple semi-supervised dependency parsing is a paper by Koo, Carreras, and Collins online at [www.cs.columbia.edu/~mcollins/papers/koo08acl.pdf].

## Citation

Terry Koo, Xavier Carreras, and Michael Collins. Simple semi-supervised dependency parsing. In Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 595-603, Columbus, Ohio, USA, June 2008.

## Summary

The authors's experiment is to train a dependency parser in English and Czech. Their improvement is to use word clusters as an additional feature. The word clusters are drawn from the BLLIP corpus, which uses the Brown clustering algorithm. They examine (1) the added performance by adding word clusters as features and (2) the amount of of data is needed to achieve the same accuracy without the word cluster features. It turns out that with the cluster features, the parser does statistically significantly better. They also observe that one only needs about half the data when using cluster features to achieve the same accuracy.

## Clustering

The use the Brown clustering algorithm to group words into groups. The result is a mapping from words to bit-strings where the closer the bit-strings, the closer the words are. With the bit-strings, we can decide how many clusters to allow by only considering the first k bits. If the tree were perfectly balanced, that would give ${\displaystyle 2^{k}}$ clusters. For the English parser, they use a word clusters from the BLLIP corpus; for the Czech parser they cluster they make their own clusters.

## Experiment

They trained their parsers using average perceptron. They devote a fair bit of work to choosing the appropriate features. It turns out that choosing the right length of the prefix was very important to getting good features. It turns out that using an average length prefix (somewhere in the range of 12 to 20 bits) did not produce good features. Instead, they settled on two different types of features that relied on the word clusters: one short bit prefix (4-6 bits; they said it overlapped with the part of speech features) and longest possible bit strings. One thing they found was that combining part of speech features with short bit prefix features was helpful as well. They hypothesized this could be because the clustering algorithm wasn't able to find syntax similarities in words.

The experiment was a success. Using word cluster features, the authors improved their baseline from 92.02% to 93.16% for English and from 86.13% to 87.13% in Czech. As noted in the summary, they found that the word cluster features significantly improved performance for all sizes of the training set. They also noted that for the same accuracy, one only had to train the word cluster parser on half the data as a non-word cluster parser.

Next, they investigated the part of speech tagger. They used MXPOST to tag the part of speech as a feature to the parser. When they began to reduce the training data, tried also reducing the training data that the part of speech tagger could run on. It turns out that reducing the amount of training data the part of speech tagger had had little effect on the improvement the word clustering features were able to give. It turns out that the parser with limited training data for the POS tagger and word clustering was about as good as the parser with full training data for the POS tagger and no word clustering features.

## Related Work

According to the authors, their paper was heavily influenced by Miller et al, HLT 2004, which used word clustering to improve a named entity recognizer. Wang et al, IWPT 2005 does similar work with word clustering, but they use a generative model instead of a discriminative one.