Berg-Kirkpatrick et al, ACL 2010: Painless Unsupervised Learning with Features

From Cohen Courses
Revision as of 17:53, 17 September 2011 by Yunwang (talk | contribs)
Jump to navigationJump to search

Citation

T. Berg-Kirkpatrick, A. Bouchard-Côté, J. DeNero, and D. Klein. Painless Unsupervised Learning with Features, Human Language Technologies 2010, pp. 582-590, Los Angeles, June 2010.

Online Version

PDF version

Summary

This paper generalizes conventional HMMs to featurized HMMs, by replacing the multinomial conditional probability distributions (CPDs) with miniature log-linear models. Two algorithms for unsupervised training of featurized HMMs are proposed.

Featurized HMMs are applied to four unsupervised learning tasks:

  • POS induction (unsupervised version of POS tagging);
  • Grammar induction;
  • Word alignment;
  • Word segmentation.

For all these four tasks, featurized HMMs are shown to outperform their unfeaturized counterparts by a substantial margin.

Featurized HMMs

Definition

The definition of featurized HMMs is given in the context of POS tagging, but is applicable to the general case. In the following text, stands for the random sequence of words, and stands for the random sequence of POS tags. Their lower-case counterparts stands for possible values of the words and POS tags.

Assuming a special starting tag , an HMM defines a joint probability over and :

While a conventional HMM defines the transition probabilities and emission probabilities as multinomial distributions, a featurized HMM defines these probabilities in terms of a collection of features and their weights :

For conciseness, the tuples and are written as (standing for decision, context, and type respectively):

A conventional HMM is a special case of a featurized HMM (as long as no zero transition or emission probability is involved). We get a conventional HMM if we:

  • Use the set of "basic" features, i.e. indicator functions of all tuples and ;
  • Set the weights of the features as the logarithm of the corresponding transition or emission probabilities.

The Estimation Problem

The estimation problem involves finding the marginal probability for a given word sequence , with the model parameters known.

The forward-backward algorithm for conventional HMMs is directly applicable to featurized HMMs.

The Decoding Problem

The decoding problem involves finding the most probable tag sequence given the word sequence , with the model parameters known:

The Viterbi algorithm for conventional algorithm is directly applicable to featurized HMMs.

The Training Problem

The training problem involves finding a set of model parameters such that the likelihood of the training data is largest:

Experiments

POS Induction

Grammar Induction

Word Alignment

Word Segmentation