Difference between revisions of "Featurized HMM"

From Cohen Courses
Jump to navigationJump to search
(Created page with 'Featurized HMMs generalize conventional HMMs by replacing the multinomial conditional probability distributions (CPDs) with miniature log-linear models. They are proposed…')
 
 
(3 intermediate revisions by one other user not shown)
Line 1: Line 1:
Featurized HMMs generalize conventional [[HMM|HMMs]] by replacing the multinomial conditional probability distributions (CPDs) with miniature log-linear models. They are proposed in [[RelatedPaper::Berg-Kirkpatrick, ACL 2010: Painless Unsupervised Learning with Features|Berg-Kirkpatrick, ACL 2010]].
+
This is a [[Category::method]] proposed in [[RelatedPaper::Berg-Kirkpatrick, ACL 2010: Painless Unsupervised Learning with Features|Berg-Kirkpatrick, ACL 2010]].
 +
 
 +
Featurized HMMs are generative probabilistic models that generalize conventional [[HMM|HMMs]] by replacing the multinomial conditional probability distributions (CPDs) with miniature log-linear models.
  
 
== Definition ==
 
== Definition ==
Line 26: Line 28:
 
\theta_{d,c,t}(\mathbf{w}) = \frac {\exp \mathbf{w}^T \mathbf{f}(d,c,t)} {\sum_{d'} \exp \mathbf{w}^T \mathbf{f}(d',c,t)}
 
\theta_{d,c,t}(\mathbf{w}) = \frac {\exp \mathbf{w}^T \mathbf{f}(d,c,t)} {\sum_{d'} \exp \mathbf{w}^T \mathbf{f}(d',c,t)}
 
</math>
 
</math>
 +
 +
== Relationship to Conventional HMMs ==
  
 
A conventional HMM is a special case of a featurized HMM (as long as no zero transition or emission probability is involved). We get a conventional HMM if we:
 
A conventional HMM is a special case of a featurized HMM (as long as no zero transition or emission probability is involved). We get a conventional HMM if we:
Line 84: Line 88:
  
 
Algorithm 2 is shown to perform better <nowiki>[</nowiki>[[RelatedPaper::Berg-Kirkpatrick, ACL 2010: Painless Unsupervised Learning with Features|Berg-Kirkpatrick, ACL 2010]]<nowiki>]</nowiki>. It can also be expected to converge faster -- anyway, the E-step changes the auxiliary function by changing the expected counts, so there's no point in finding a local maximum of the auxiliary function in each iteration (this sentence is the opinion of [[User:yunwang|Maigo]]). However, in cases when the E-step is significantly more time-consuming than the M-step, Algorithm 1 may be preferred.
 
Algorithm 2 is shown to perform better <nowiki>[</nowiki>[[RelatedPaper::Berg-Kirkpatrick, ACL 2010: Painless Unsupervised Learning with Features|Berg-Kirkpatrick, ACL 2010]]<nowiki>]</nowiki>. It can also be expected to converge faster -- anyway, the E-step changes the auxiliary function by changing the expected counts, so there's no point in finding a local maximum of the auxiliary function in each iteration (this sentence is the opinion of [[User:yunwang|Maigo]]). However, in cases when the E-step is significantly more time-consuming than the M-step, Algorithm 1 may be preferred.
 +
 +
 +
==  Comments ==
 +
 +
Something I realized about Featurized HMM's versus CRF's.... I think CRF decoding is faster.  For a Featurized HMM, you have to compute per-word partition functions for every word, where the partition function sums over the vocabulary.  Not trivial.  But a CRF doesn't have this, Viterbi just does scoring without having to do local normalization.  So, a CRF's global normalization makes learning trickier, but decoding easier.  Featurized HMM's local normalization makes learning easier but decoding harder.  Hmmm.  --[[User:Brendan|Brendan]] 18:50, 13 October 2011 (UTC)

Latest revision as of 13:50, 13 October 2011

This is a method proposed in Berg-Kirkpatrick, ACL 2010.

Featurized HMMs are generative probabilistic models that generalize conventional HMMs by replacing the multinomial conditional probability distributions (CPDs) with miniature log-linear models.

Definition

In the following text, stands for a random observation sequence, and stands for a random state sequence. Their lower-case counterparts stands for possible values of the observations and states.

Assuming a special starting state , an HMM defines a joint probability over and :

While a conventional HMM defines the transition probabilities and emission probabilities as multinomial distributions, a featurized HMM defines these probabilities in terms of a collection of features and their weights :

For conciseness, the tuples and are written as (standing for decision, context, and type respectively):

Relationship to Conventional HMMs

A conventional HMM is a special case of a featurized HMM (as long as no zero transition or emission probability is involved). We get a conventional HMM if we:

  • Use the set of "basic" features, i.e. indicator functions of all tuples and ;
  • Set the weights of the features as the logarithm of the corresponding transition or emission probabilities.

The Estimation Problem

The estimation problem involves finding the marginal probability for a given observation sequence , with the model parameters known.

The forward-backward algorithm for conventional HMMs is directly applicable to featurized HMMs.

The Decoding Problem

The decoding problem involves finding the most probable state sequence given the observation sequence , with the model parameters known:

The Viterbi algorithm for conventional HMMs is directly applicable to featurized HMMs.

The Training Problem

The (unsupervised) training problem involves finding a set of model parameters to maximize the likelihood of the training observations :

Or, the objective function can be a regularized version of the log-likelihood:

The EM algorithm (also called Baum-Welch algorithm) for unsupervised training of conventional HMMs can be adapted for featurized HMMs. The E-step calculates the expected count for each tuple , which makes use of the forward-backward algorithm. The M-step tries to increase the following auxiliary function:

whose gradient w.r.t. is:

Two versions of EM algorithms are proposed in Berg-Kirkpatrick, ACL 2010:

Featurized HMM EM.png

The subroutine "climb" represents one step in any generic optimization algorithm, such as gradient ascent or LBFGS.

The differences between the two algorithms are:

  • In the M-step, Algorithm 1 iterates until the auxiliary function converges to a local maximum, while Algorithm 2 performs only one iteration.
  • The M-step of Algorithm 2 is formulated to increase the objective function directly, instead of increasing the auxiliary function. Nevertheless, the appendix gives a proof that the gradients of the objective and auxiliary functions are identical.

Algorithm 2 is shown to perform better [Berg-Kirkpatrick, ACL 2010]. It can also be expected to converge faster -- anyway, the E-step changes the auxiliary function by changing the expected counts, so there's no point in finding a local maximum of the auxiliary function in each iteration (this sentence is the opinion of Maigo). However, in cases when the E-step is significantly more time-consuming than the M-step, Algorithm 1 may be preferred.


Comments

Something I realized about Featurized HMM's versus CRF's.... I think CRF decoding is faster. For a Featurized HMM, you have to compute per-word partition functions for every word, where the partition function sums over the vocabulary. Not trivial. But a CRF doesn't have this, Viterbi just does scoring without having to do local normalization. So, a CRF's global normalization makes learning trickier, but decoding easier. Featurized HMM's local normalization makes learning easier but decoding harder. Hmmm. --Brendan 18:50, 13 October 2011 (UTC)