Difference between revisions of "Featurized HMM"

Latest revision as of 14:50, 13 October 2011

This is a method proposed in Berg-Kirkpatrick, ACL 2010.

Featurized HMMs are generative probabilistic models that generalize conventional HMMs by replacing the multinomial conditional probability distributions (CPDs) with miniature log-linear models.

Definition

In the following text, $\mathbf {Y} =\{Y_{1},\ldots ,Y_{n}\}$ stands for a random observation sequence, and $\mathbf {Z} =\{Z_{1},\ldots ,Z_{n}\}$ stands for a random state sequence. Their lower-case counterparts stands for possible values of the observations and states.

Assuming a special starting state $z_{0}$ , an HMM defines a joint probability over $\mathbf {Z}$ and $\mathbf {Y}$ :

$P(\mathbf {Z} =\mathbf {z} ,\mathbf {Y} =\mathbf {y} )=\prod _{i=1}^{n}\left[P(Z_{i}=z_{i}|Z_{i-1}=z_{i-1})\cdot P(Y_{i}=y_{i}|Z_{i}=z_{i})\right]$

While a conventional HMM defines the transition probabilities $P(Z_{i}|Z_{i-1})$ and emission probabilities $P(Y_{i}|Z_{i})$ as multinomial distributions, a featurized HMM defines these probabilities in terms of a collection of features $\mathbf {f} (d,c,t)$ and their weights $\mathbf {w}$ :

$P_{\mathbf {w} }(Z_{i}=z_{2}|Z_{i-1}=z_{1})=\theta _{z_{2},z_{1},{\text{TRANS}}}(\mathbf {w} )={\frac {\exp \mathbf {w} ^{T}\mathbf {f} (z_{2},z_{1},{\text{TRANS}})}{\sum _{z_{2}'}\exp \mathbf {w} ^{T}\mathbf {f} (z_{2}',z_{1},{\text{TRANS}})}}$

$P_{\mathbf {w} }(Y_{i}=y|Z_{i}=z)=\theta _{y,z,{\text{EMIT}}}(\mathbf {w} )={\frac {\exp \mathbf {w} ^{T}\mathbf {f} (y,z,{\text{EMIT}})}{\sum _{y'}\exp \mathbf {w} ^{T}\mathbf {f} (y',z,{\text{EMIT}})}}$

For conciseness, the tuples $z_{2},z_{1},{\text{TRANS}}$ and $y,z,{\text{EMIT}}$ are written as $d,c,t$ (standing for decision, context, and type respectively):

$\theta _{d,c,t}(\mathbf {w} )={\frac {\exp \mathbf {w} ^{T}\mathbf {f} (d,c,t)}{\sum _{d'}\exp \mathbf {w} ^{T}\mathbf {f} (d',c,t)}}$

Relationship to Conventional HMMs

A conventional HMM is a special case of a featurized HMM (as long as no zero transition or emission probability is involved). We get a conventional HMM if we:

Use the set of "basic" features, i.e. indicator functions of all tuples $z_{2},z_{1},{\text{TRANS}}$ and $y,z,{\text{EMIT}}$ ;
Set the weights of the features as the logarithm of the corresponding transition or emission probabilities.

The Estimation Problem

The estimation problem involves finding the marginal probability $P_{\mathbf {w} }(\mathbf {Y} =\mathbf {y} )$ for a given observation sequence $\mathbf {y}$ , with the model parameters $\mathbf {w}$ known.

The forward-backward algorithm for conventional HMMs is directly applicable to featurized HMMs.

The Decoding Problem

The decoding problem involves finding the most probable state sequence $\mathbf {\hat {z}}$ given the observation sequence $\mathbf {y}$ , with the model parameters $\mathbf {w}$ known:

$\mathbf {\hat {z}} ={\underset {\mathbf {z} }{\operatorname {arg\,max} }}\,P_{\mathbf {w} }(\mathbf {Z} =\mathbf {z} |\mathbf {Y} =\mathbf {y} )$

The Viterbi algorithm for conventional HMMs is directly applicable to featurized HMMs.

The Training Problem

The (unsupervised) training problem involves finding a set of model parameters $\mathbf {\hat {w}}$ to maximize the likelihood of the training observations $\mathbf {y}$ :

$\mathbf {\hat {w}} ={\underset {\mathbf {w} }{\operatorname {arg\,max} }}\,P_{\mathbf {w} }(\mathbf {Y} =\mathbf {y} )$

Or, the objective function can be a regularized version of the log-likelihood:

$L(\mathbf {w} )=\log P_{\mathbf {w} }(\mathbf {Y} =\mathbf {y} )-\kappa ||\mathbf {w} ||_{2}^{2}$

The EM algorithm (also called Baum-Welch algorithm) for unsupervised training of conventional HMMs can be adapted for featurized HMMs. The E-step calculates the expected count $e_{d,c,t}$ for each tuple $d,c,t$ , which makes use of the forward-backward algorithm. The M-step tries to increase the following auxiliary function:

$\ell (\mathbf {w} ,\mathbf {e} )=\sum _{d,c,t}e_{d,c,t}\log \theta _{d,c,t}(\mathbf {w} )-\kappa ||\mathbf {w} ||_{2}^{2}$

whose gradient w.r.t. $\mathbf {w}$ is:

$\nabla _{\mathbf {w} }\ell (\mathbf {w} ,\mathbf {e} )=\sum _{d,c,t}e_{d,c,t}\left[\mathbf {f} (d,c,t)-\sum _{d'}\theta _{d',c,t}(\mathbf {w} )\mathbf {f} (d',c,t)\right]-2\kappa \cdot \mathbf {w}$

Two versions of EM algorithms are proposed in Berg-Kirkpatrick, ACL 2010:

The subroutine "climb" represents one step in any generic optimization algorithm, such as gradient ascent or LBFGS.

The differences between the two algorithms are:

In the M-step, Algorithm 1 iterates until the auxiliary function $\ell (\mathbf {w} ,\mathbf {e} )$ converges to a local maximum, while Algorithm 2 performs only one iteration.
The M-step of Algorithm 2 is formulated to increase the objective function $L(\mathbf {w} )$ directly, instead of increasing the auxiliary function. Nevertheless, the appendix gives a proof that the gradients of the objective and auxiliary functions are identical.

Algorithm 2 is shown to perform better [Berg-Kirkpatrick, ACL 2010]. It can also be expected to converge faster -- anyway, the E-step changes the auxiliary function by changing the expected counts, so there's no point in finding a local maximum of the auxiliary function in each iteration (this sentence is the opinion of Maigo). However, in cases when the E-step is significantly more time-consuming than the M-step, Algorithm 1 may be preferred.

Comments

Something I realized about Featurized HMM's versus CRF's.... I think CRF decoding is faster. For a Featurized HMM, you have to compute per-word partition functions for every word, where the partition function sums over the vocabulary. Not trivial. But a CRF doesn't have this, Viterbi just does scoring without having to do local normalization. So, a CRF's global normalization makes learning trickier, but decoding easier. Featurized HMM's local normalization makes learning easier but decoding harder. Hmmm. --Brendan 18:50, 13 October 2011 (UTC)

@@ Line 1: / Line 1: @@
-Featurized HMMs generalize conventional [[HMM|HMMs]] by replacing the multinomial conditional probability distributions (CPDs) with miniature log-linear models. They are proposed in [[RelatedPaper::Berg-Kirkpatrick, ACL 2010: Painless Unsupervised Learning with Features|Berg-Kirkpatrick, ACL 2010]].
+This is a [[Category::method]] proposed in [[RelatedPaper::Berg-Kirkpatrick, ACL 2010: Painless Unsupervised Learning with Features|Berg-Kirkpatrick, ACL 2010]].
+Featurized HMMs are generative probabilistic models that generalize conventional [[HMM|HMMs]] by replacing the multinomial conditional probability distributions (CPDs) with miniature log-linear models.
 == Definition ==
@@ Line 26: / Line 28: @@
 \theta_{d,c,t}(\mathbf{w}) = \frac {\exp \mathbf{w}^T \mathbf{f}(d,c,t)} {\sum_{d'} \exp \mathbf{w}^T \mathbf{f}(d',c,t)}
 </math>
+== Relationship to Conventional HMMs ==
 A conventional HMM is a special case of a featurized HMM (as long as no zero transition or emission probability is involved). We get a conventional HMM if we:
@@ Line 84: / Line 88: @@
 Algorithm 2 is shown to perform better <nowiki>[</nowiki>[[RelatedPaper::Berg-Kirkpatrick, ACL 2010: Painless Unsupervised Learning with Features|Berg-Kirkpatrick, ACL 2010]]<nowiki>]</nowiki>. It can also be expected to converge faster -- anyway, the E-step changes the auxiliary function by changing the expected counts, so there's no point in finding a local maximum of the auxiliary function in each iteration (this sentence is the opinion of [[User:yunwang|Maigo]]). However, in cases when the E-step is significantly more time-consuming than the M-step, Algorithm 1 may be preferred.
+==  Comments ==
+Something I realized about Featurized HMM's versus CRF's.... I think CRF decoding is faster.  For a Featurized HMM, you have to compute per-word partition functions for every word, where the partition function sums over the vocabulary.  Not trivial.  But a CRF doesn't have this, Viterbi just does scoring without having to do local normalization.  So, a CRF's global normalization makes learning trickier, but decoding easier.  Featurized HMM's local normalization makes learning easier but decoding harder.  Hmmm.  --[[User:Brendan|Brendan]] 18:50, 13 October 2011 (UTC)

Difference between revisions of "Featurized HMM"

Latest revision as of 14:50, 13 October 2011

Contents

Definition

Relationship to Conventional HMMs

The Estimation Problem

The Decoding Problem

The Training Problem

Comments

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

Tools