Haghighi and Klein, ACL 2006: Prototype-Driven Learning for Sequence Models
A. Haghighi and D. Klein. Prototype-Driven Learning for Sequence Models, Proceedings of the Human Language Technology Conference of the North American Chapter of the ACL, pp. 320-327, New York, June 2006.
This paper addresses the problem of POS tagging in both English and Chinese, and the problem of field segmentation in the domain of classified advertisements. The latter is also addressed in Grenager et al, ACL 2005.
Model: Markov Random Field
The modeling tool used in this paper is Markov random field (MRF). This is the generative version of conditional random field (CRF): an MRF defines a joint distribution over the states and observations, whereas a CRF defines a conditional distribution over the observations given the states.
For POS tagging, the states are chosen as pairs of POS tags. For field segmentation, the states are field labels, as in Grenager et al, ACL 2005.
A difficulty that arises in the training of the MRF is that the sequence length is unconstrained. The authors set a maximum length and sum over all sequences within this length.
For decoding, the authors use maximum posterior decoding (at each position choosing the label which has the highest posterior probability, obtained from the forward-backward algorithm) instead of Viterbi decoding. The former is found to be "uniformly but slightly superior" to the latter.
As is pointed out in Grenager et al, ACL 2005, pure unconstrained unsupervised learning doesn't learn very well because of the existence of multiple levels of structure in training documents. As a solution, this paper advocates prototype-driven learning. This can be considered as a form of semi-supervised learning, but unlike conventional semi-supervised learning where a portion of the training documents are fully labeled, in prototype-driven learning, a list of "prototype words" is provided for each label. This requires less human effort than conventional semi-supervised learning.
Two ways to utilize the prototype words are discussed. The first way is relatively simple: assigning to the prototype words their respective labels in the training data. While this increases the overall accuracy significantly, it does not increase the accuracy for non-prototype words a lot. This indicates that "the prototype information is not spreading to non-prototype words."
In order to make non-prototype words benefit from the similar prototype words, a "distributional similarity feature" is devised based on word context. Words that have a similar context distribution with a prototype word z activate a feature "PROTO = z", so they are "pushed toward" the label of the prototype word. This significantly boosts the accuracy both overall and for non-prototype words. The similarity feature is designed differently to capture the different level of desired structure: low-level for POS tagging vs high-level for field segmentation.
- For English POS tagging: Penn Treebank English WSJ (Test set contains 193K tokens, 8K sentences)
- For Chinese POS tagging: Penn Treebank Chinese (Test set contains 60K tokens)
- For field segmentation: Classified advertisements for apartment rental on Craigslist (See Grenager et al, ACL 2005)
The criterion used is per-token accuracy.
The following table reports a part of the results that are closely related to the techniques introduced above.
|English POS Tagging||Chinese POS Tagging
|PROTO + SIM||80.5||67.8||57.4||71.5|
- BASELINE denotes an MRF with some token and label features but without prototypes.
- PROTO denotes only fixing the labels of prototype words.
- PROTO + SIM denotes also incorporating the distribution similarity features.