Featurized HMM

From Cohen Courses
Revision as of 19:17, 17 September 2011 by Yunwang (talk | contribs) (Created page with 'Featurized HMMs generalize conventional HMMs by replacing the multinomial conditional probability distributions (CPDs) with miniature log-linear models. They are proposed…')
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigationJump to search

Featurized HMMs generalize conventional HMMs by replacing the multinomial conditional probability distributions (CPDs) with miniature log-linear models. They are proposed in Berg-Kirkpatrick, ACL 2010.

Definition

In the following text, stands for a random observation sequence, and stands for a random state sequence. Their lower-case counterparts stands for possible values of the observations and states.

Assuming a special starting state , an HMM defines a joint probability over and :

While a conventional HMM defines the transition probabilities and emission probabilities as multinomial distributions, a featurized HMM defines these probabilities in terms of a collection of features and their weights :

For conciseness, the tuples and are written as (standing for decision, context, and type respectively):

A conventional HMM is a special case of a featurized HMM (as long as no zero transition or emission probability is involved). We get a conventional HMM if we:

  • Use the set of "basic" features, i.e. indicator functions of all tuples and ;
  • Set the weights of the features as the logarithm of the corresponding transition or emission probabilities.

The Estimation Problem

The estimation problem involves finding the marginal probability for a given observation sequence , with the model parameters known.

The forward-backward algorithm for conventional HMMs is directly applicable to featurized HMMs.

The Decoding Problem

The decoding problem involves finding the most probable state sequence given the observation sequence , with the model parameters known:

The Viterbi algorithm for conventional HMMs is directly applicable to featurized HMMs.

The Training Problem

The (unsupervised) training problem involves finding a set of model parameters to maximize the likelihood of the training observations :

Or, the objective function can be a regularized version of the log-likelihood:

The EM algorithm (also called Baum-Welch algorithm) for unsupervised training of conventional HMMs can be adapted for featurized HMMs. The E-step calculates the expected count for each tuple , which makes use of the forward-backward algorithm. The M-step tries to increase the following auxiliary function:

whose gradient w.r.t. is:

Two versions of EM algorithms are proposed in Berg-Kirkpatrick, ACL 2010:

Featurized HMM EM.png

The subroutine "climb" represents one step in any generic optimization algorithm, such as gradient ascent or LBFGS.

The differences between the two algorithms are:

  • In the M-step, Algorithm 1 iterates until the auxiliary function converges to a local maximum, while Algorithm 2 performs only one iteration.
  • The M-step of Algorithm 2 is formulated to increase the objective function directly, instead of increasing the auxiliary function. Nevertheless, the appendix gives a proof that the gradients of the objective and auxiliary functions are identical.

Algorithm 2 is shown to perform better [Berg-Kirkpatrick, ACL 2010]. It can also be expected to converge faster -- anyway, the E-step changes the auxiliary function by changing the expected counts, so there's no point in finding a local maximum of the auxiliary function in each iteration (this sentence is the opinion of Maigo). However, in cases when the E-step is significantly more time-consuming than the M-step, Algorithm 1 may be preferred.