Difference between revisions of "Featurized HMM"
(Created page with 'Featurized HMMs generalize conventional HMMs by replacing the multinomial conditional probability distributions (CPDs) with miniature log-linear models. They are proposed…') |
|||
(3 intermediate revisions by one other user not shown) | |||
Line 1: | Line 1: | ||
− | + | This is a [[Category::method]] proposed in [[RelatedPaper::Berg-Kirkpatrick, ACL 2010: Painless Unsupervised Learning with Features|Berg-Kirkpatrick, ACL 2010]]. | |
+ | |||
+ | Featurized HMMs are generative probabilistic models that generalize conventional [[HMM|HMMs]] by replacing the multinomial conditional probability distributions (CPDs) with miniature log-linear models. | ||
== Definition == | == Definition == | ||
Line 26: | Line 28: | ||
\theta_{d,c,t}(\mathbf{w}) = \frac {\exp \mathbf{w}^T \mathbf{f}(d,c,t)} {\sum_{d'} \exp \mathbf{w}^T \mathbf{f}(d',c,t)} | \theta_{d,c,t}(\mathbf{w}) = \frac {\exp \mathbf{w}^T \mathbf{f}(d,c,t)} {\sum_{d'} \exp \mathbf{w}^T \mathbf{f}(d',c,t)} | ||
</math> | </math> | ||
+ | |||
+ | == Relationship to Conventional HMMs == | ||
A conventional HMM is a special case of a featurized HMM (as long as no zero transition or emission probability is involved). We get a conventional HMM if we: | A conventional HMM is a special case of a featurized HMM (as long as no zero transition or emission probability is involved). We get a conventional HMM if we: | ||
Line 84: | Line 88: | ||
Algorithm 2 is shown to perform better <nowiki>[</nowiki>[[RelatedPaper::Berg-Kirkpatrick, ACL 2010: Painless Unsupervised Learning with Features|Berg-Kirkpatrick, ACL 2010]]<nowiki>]</nowiki>. It can also be expected to converge faster -- anyway, the E-step changes the auxiliary function by changing the expected counts, so there's no point in finding a local maximum of the auxiliary function in each iteration (this sentence is the opinion of [[User:yunwang|Maigo]]). However, in cases when the E-step is significantly more time-consuming than the M-step, Algorithm 1 may be preferred. | Algorithm 2 is shown to perform better <nowiki>[</nowiki>[[RelatedPaper::Berg-Kirkpatrick, ACL 2010: Painless Unsupervised Learning with Features|Berg-Kirkpatrick, ACL 2010]]<nowiki>]</nowiki>. It can also be expected to converge faster -- anyway, the E-step changes the auxiliary function by changing the expected counts, so there's no point in finding a local maximum of the auxiliary function in each iteration (this sentence is the opinion of [[User:yunwang|Maigo]]). However, in cases when the E-step is significantly more time-consuming than the M-step, Algorithm 1 may be preferred. | ||
+ | |||
+ | |||
+ | == Comments == | ||
+ | |||
+ | Something I realized about Featurized HMM's versus CRF's.... I think CRF decoding is faster. For a Featurized HMM, you have to compute per-word partition functions for every word, where the partition function sums over the vocabulary. Not trivial. But a CRF doesn't have this, Viterbi just does scoring without having to do local normalization. So, a CRF's global normalization makes learning trickier, but decoding easier. Featurized HMM's local normalization makes learning easier but decoding harder. Hmmm. --[[User:Brendan|Brendan]] 18:50, 13 October 2011 (UTC) |
Latest revision as of 13:50, 13 October 2011
This is a method proposed in Berg-Kirkpatrick, ACL 2010.
Featurized HMMs are generative probabilistic models that generalize conventional HMMs by replacing the multinomial conditional probability distributions (CPDs) with miniature log-linear models.
Contents
Definition
In the following text, stands for a random observation sequence, and stands for a random state sequence. Their lower-case counterparts stands for possible values of the observations and states.
Assuming a special starting state , an HMM defines a joint probability over and :
While a conventional HMM defines the transition probabilities and emission probabilities as multinomial distributions, a featurized HMM defines these probabilities in terms of a collection of features and their weights :
For conciseness, the tuples and are written as (standing for decision, context, and type respectively):
Relationship to Conventional HMMs
A conventional HMM is a special case of a featurized HMM (as long as no zero transition or emission probability is involved). We get a conventional HMM if we:
- Use the set of "basic" features, i.e. indicator functions of all tuples and ;
- Set the weights of the features as the logarithm of the corresponding transition or emission probabilities.
The Estimation Problem
The estimation problem involves finding the marginal probability for a given observation sequence , with the model parameters known.
The forward-backward algorithm for conventional HMMs is directly applicable to featurized HMMs.
The Decoding Problem
The decoding problem involves finding the most probable state sequence given the observation sequence , with the model parameters known:
The Viterbi algorithm for conventional HMMs is directly applicable to featurized HMMs.
The Training Problem
The (unsupervised) training problem involves finding a set of model parameters to maximize the likelihood of the training observations :
Or, the objective function can be a regularized version of the log-likelihood:
The EM algorithm (also called Baum-Welch algorithm) for unsupervised training of conventional HMMs can be adapted for featurized HMMs. The E-step calculates the expected count for each tuple , which makes use of the forward-backward algorithm. The M-step tries to increase the following auxiliary function:
whose gradient w.r.t. is:
Two versions of EM algorithms are proposed in Berg-Kirkpatrick, ACL 2010:
The subroutine "climb" represents one step in any generic optimization algorithm, such as gradient ascent or LBFGS.
The differences between the two algorithms are:
- In the M-step, Algorithm 1 iterates until the auxiliary function converges to a local maximum, while Algorithm 2 performs only one iteration.
- The M-step of Algorithm 2 is formulated to increase the objective function directly, instead of increasing the auxiliary function. Nevertheless, the appendix gives a proof that the gradients of the objective and auxiliary functions are identical.
Algorithm 2 is shown to perform better [Berg-Kirkpatrick, ACL 2010]. It can also be expected to converge faster -- anyway, the E-step changes the auxiliary function by changing the expected counts, so there's no point in finding a local maximum of the auxiliary function in each iteration (this sentence is the opinion of Maigo). However, in cases when the E-step is significantly more time-consuming than the M-step, Algorithm 1 may be preferred.
Comments
Something I realized about Featurized HMM's versus CRF's.... I think CRF decoding is faster. For a Featurized HMM, you have to compute per-word partition functions for every word, where the partition function sums over the vocabulary. Not trivial. But a CRF doesn't have this, Viterbi just does scoring without having to do local normalization. So, a CRF's global normalization makes learning trickier, but decoding easier. Featurized HMM's local normalization makes learning easier but decoding harder. Hmmm. --Brendan 18:50, 13 October 2011 (UTC)