Difference between revisions of "Berg-Kirkpatrick et al, ACL 2010: Painless Unsupervised Learning with Features"

Revision as of 19:41, 17 September 2011

Citation

T. Berg-Kirkpatrick, A. Bouchard-Côté, J. DeNero, and D. Klein. Painless Unsupervised Learning with Features, Human Language Technologies 2010, pp. 582-590, Los Angeles, June 2010.

Online Version

PDF version

Summary

This paper generalizes conventional HMMs to featurized HMMs, by replacing the multinomial conditional probability distributions (CPDs) with miniature log-linear models. Two algorithms for unsupervised training of featurized HMMs are proposed.

Featurized HMMs are applied to four unsupervised learning tasks:

POS induction (unsupervised version of POS tagging);
Grammar induction;
Word alignment;
Word segmentation.

For all these four tasks, featurized HMMs are shown to outperform their unfeaturized counterparts by a substantial margin.

Featurized HMMs

Definition

The definition of featurized HMMs is given in the context of POS tagging, but is applicable to the general case. In the following text, $\mathbf {Y} =\{Y_{1},\ldots ,Y_{n}\}$ stands for the random sequence of words, and $\mathbf {Z} =\{Z_{1},\ldots ,Z_{n}\}$ stands for the random sequence of POS tags. Their lower-case counterparts stands for possible values of the words and POS tags.

Assuming a special starting tag $z_{0}$ , an HMM defines a joint probability over $\mathbf {Z}$ and $\mathbf {Y}$ :

$P(\mathbf {Z} =\mathbf {z} ,\mathbf {Y} =\mathbf {y} )=\prod _{i=1}^{n}\left[P(Z_{i}=z_{i}|Z_{i-1}=z_{i-1})\cdot P(Y_{i}=y_{i}|Z_{i}=z_{i})\right]$

While a conventional HMM defines the transition probabilities $P(Z_{i}|Z_{i-1})$ and emission probabilities $P(Y_{i}|Z_{i})$ as multinomial distributions, a featurized HMM defines these probabilities in terms of a collection of features $\mathbf {f} (d,c,t)$ and their weights $\mathbf {w}$ :

$P_{\mathbf {w} }(Z_{i}=z_{2}|Z_{i-1}=z_{1})=\theta _{z_{2},z_{1},{\text{TRANS}}}(\mathbf {w} )={\frac {\exp \mathbf {w} ^{T}\mathbf {f} (z_{2},z_{1},{\text{TRANS}})}{\sum _{z_{2}'}\exp \mathbf {w} ^{T}\mathbf {f} (z_{2}',z_{1},{\text{TRANS}})}}$

$P_{\mathbf {w} }(Y_{i}=y|Z_{i}=z)=\theta _{y,z,{\text{EMIT}}}(\mathbf {w} )={\frac {\exp \mathbf {w} ^{T}\mathbf {f} (y,z,{\text{EMIT}})}{\sum _{y'}\exp \mathbf {w} ^{T}\mathbf {f} (y',z,{\text{EMIT}})}}$

For conciseness, the tuples $z_{2},z_{1},{\text{TRANS}}$ and $y,z,{\text{EMIT}}$ are written as $d,c,t$ (standing for decision, context, and type respectively):

$\theta _{d,c,t}(\mathbf {w} )={\frac {\exp \mathbf {w} ^{T}\mathbf {f} (d,c,t)}{\sum _{d'}\exp \mathbf {w} ^{T}\mathbf {f} (d',c,t)}}$

A conventional HMM is a special case of a featurized HMM (as long as no zero transition or emission probability is involved). We get a conventional HMM if we:

Use the set of "basic" features, i.e. indicator functions of all tuples $z_{2},z_{1},{\text{TRANS}}$ and $y,z,{\text{EMIT}}$ ;
Set the weights of the features as the logarithm of the corresponding transition or emission probabilities.

The Estimation Problem

The estimation problem involves finding the marginal probability $P_{\mathbf {w} }(\mathbf {Y} =\mathbf {y} )$ for a given word sequence $\mathbf {y}$ , with the model parameters $\mathbf {w}$ known.

The forward-backward algorithm for conventional HMMs is directly applicable to featurized HMMs.

The Decoding Problem

The decoding problem involves finding the most probable tag sequence $\mathbf {\hat {z}}$ given the word sequence $\mathbf {y}$ , with the model parameters $\mathbf {w}$ known:

$\mathbf {\hat {z}} ={\underset {\mathbf {z} }{\operatorname {arg\,max} }}\,P_{\mathbf {w} }(\mathbf {Z} =\mathbf {z} |\mathbf {Y} =\mathbf {y} )$

The Viterbi algorithm for conventional HMMs is directly applicable to featurized HMMs.

The Training Problem

The training problem involves finding a set of model parameters $\mathbf {\hat {w}}$ to maximize the likelihood of the training data $\mathbf {y}$ :

$\mathbf {\hat {w}} ={\underset {\mathbf {w} }{\operatorname {arg\,max} }}\,P_{\mathbf {w} }(\mathbf {Y} =\mathbf {y} )$

Or, the objective function can be a regularized version of the log-likelihood:

$L(\mathbf {w} )=\log P_{\mathbf {w} }(\mathbf {Y} =\mathbf {y} )-\kappa ||\mathbf {w} ||_{2}^{2}$

The EM algorithm (also called Baum-Welch algorithm) for unsupervised training of conventional HMMs can be adapted for featurized HMMs. The E-step calculates the expected count $e_{d,c,t}$ for each tuple $d,c,t$ , which makes use of the forward-backward algorithm. The M-step tries to increase the following auxiliary function:

$\ell (\mathbf {w} ,\mathbf {e} )=\sum _{d,c,t}e_{d,c,t}\log \theta _{d,c,t}(\mathbf {w} )-\kappa ||\mathbf {w} ||_{2}^{2}$

whose gradient w.r.t. $\mathbf {w}$ is:

$\nabla _{\mathbf {w} }\ell (\mathbf {w} ,\mathbf {e} )=\sum _{d,c,t}e_{d,c,t}\left[\mathbf {f} (d,c,t)-\sum _{d'}\theta _{d',c,t}(\mathbf {w} )\mathbf {f} (d',c,t)\right]-2\kappa \cdot \mathbf {w}$

Two versions of EM algorithms are proposed in the paper:

The subroutine "climb" represents one step in any generic optimization algorithm, such as gradient ascent or LBFGS.

The differences between the two algorithms are:

In the M-step, Algorithm 1 iterates until the auxiliary function $\ell (\mathbf {w} ,\mathbf {e} )$ converges to a local maximum, while Algorithm 2 performs only one iteration.
The M-step of Algorithm 2 is formulated to increase the objective function $L(\mathbf {w} )$ directly, instead of increasing the auxiliary function. Nevertheless, the appendix gives a proof that the gradients of the objective and auxiliary functions are identical.

Algorithm 2 is shown to perform better. It can also be expected to converge faster -- anyway, the E-step changes the auxiliary function by changing the expected counts, so there's no point in finding a local maximum of the auxiliary function in each iteration (this sentence is the opinion of Maigo). However, in cases when the E-step is significantly more time-consuming than the M-step, Algorithm 1 may be preferred.

Experiments

POS Induction

POS induction is the unsupervised version of POS tagging. The output is clusters of words that the system believes to belong to the same part-of-speech. In order to evaluate the performance of a POS inductor, it is necessary to map the clusters to the actual POS tags. The best accuracy achieved by all the mapping is called the "many-1 accuracy".

Dataset	Penn Treebank WSJ
Criterion	Many-1 accuracy
Baseline	63.1 ± 1.3 (HMM) (10 runs, mean ± standard deviation)
Performance of proposed systems	68.1 ± 1.7 (Algorithm 1) 75.5 ± 1.1 (Algorithm 2)
Performance of contrastive systems	59.6 ± 6.9 (Featurized MRF) [Haghighi and Klein, ACL 2006]

Grammar Induction

Dataset	English: Penn Treebank WSJ Chinese: Penn Chinese Treebank
Criterion	Accuracy (not clear how it is defined)
Baseline	English 47.8, Chinese 42.5
Performance of proposed systems	English 48.3, Chinese 49.9 (Algorithm 1) English 63.0, Chinese 53.6 (Algorithm 2)
Performance of contrastive systems	English 61.3, Chinese 51.9 [Cohen and Smith, ACL 2009]

@@ Line 12: / Line 12: @@
 Featurized HMMs are applied to four unsupervised learning tasks:
-* POS induction (unsupervised version of [[Part_of_Speech_Tagging|POS tagging]]);
+* [[AddressesProblem::Part_of_Speech_Induction|POS induction]] (unsupervised version of [[Part_of_Speech_Tagging|POS tagging]]);
-* Grammar induction;
+* [[AddressesProblem::Grammar induction]];
-* Word alignment;
+* [[AddressesProblem::Word alignment]];
-* Word segmentation.
+* [[AddressesProblem::Word segmentation]].
 For all these four tasks, featurized HMMs are shown to outperform their unfeaturized counterparts by a substantial margin.
@@ Line 107: / Line 107: @@
 === POS Induction ===
+POS induction is the unsupervised version of [[Part_of_Speech_Tagging|POS tagging]]. The output is clusters of words that the system believes to belong to the same part-of-speech. In order to evaluate the performance of a POS inductor, it is necessary to map the clusters to the actual POS tags. The best accuracy achieved by all the mapping is called the "many-1 accuracy".
+{| border=1 cellpadding=3
+|-
+! align=right | Dataset
+| [[UsesDataset::Penn Treebank WSJ]]
+|-
+! align=right | Criterion
+| Many-1 accuracy
+|-
+! align=right | Baseline
+| 63.1 ± 1.3 (HMM) (10 runs, mean ± standard deviation)
+|-
+! align=right | Performance of proposed systems
+| 68.1 ± 1.7 (Algorithm 1) <br> 75.5 ± 1.1 (Algorithm 2)
+|-
+! align=right | Performance of contrastive systems
+| 59.6 ± 6.9 (Featurized MRF) <nowiki>[</nowiki>[[Prototype-driven Learning for Sequence Models|Haghighi and Klein, ACL 2006]]<nowiki>]</nowiki>
+|}
 === Grammar Induction ===
+{| border=1 cellpadding=3
+|-
+! align=right | Dataset
+| English: [[UsesDataset::Penn Treebank WSJ]] <br> Chinese: [[UsesDataset::Penn Chinese Treebank]]
+|-
+! align=right | Criterion
+| Accuracy (not clear how it is defined)
+|-
+! align=right | Baseline
+| English 47.8, Chinese 42.5
+|-
+! align=right | Performance of proposed systems
+| English 48.3, Chinese 49.9 (Algorithm 1) <br> English 63.0, Chinese 53.6 (Algorithm 2)
+|-
+! align=right | Performance of contrastive systems
+| English 61.3, Chinese 51.9 <nowiki>[</nowiki>[[Shared Logistic Normal Distributions for Soft Parameter Tying in Unsupervised Grammer Induction|Cohen and Smith, ACL 2009]]<nowiki>]</nowiki>
+|}
 === Word Alignment ===
 === Word Segmentation ===

Difference between revisions of "Berg-Kirkpatrick et al, ACL 2010: Painless Unsupervised Learning with Features"

Revision as of 19:41, 17 September 2011

Contents

Citation

Online Version

Summary

Featurized HMMs

Definition

The Estimation Problem

The Decoding Problem

The Training Problem

Experiments

POS Induction

Grammar Induction

Word Alignment

Word Segmentation

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

Tools