Difference between revisions of "Berg-Kirkpatrick et al, ACL 2010: Painless Unsupervised Learning with Features"

From Cohen Courses
Jump to navigationJump to search
 
(19 intermediate revisions by the same user not shown)
Line 1: Line 1:
 
== Citation ==
 
== Citation ==
  
T. Berg-Kirkpatrick, A. Bouchard-Côté, J. DeNero, and D. Klein. '''Painless Unsupervised Learning with Features''', ''Human Language Technologies 2010'', pp. 582-590, Los Angeles, June 2010.
+
T. Berg-Kirkpatrick, A. Bouchard-Côté, J. DeNero and D. Klein. '''Painless Unsupervised Learning with Features''', ''Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the ACL'', pp. 582-590, Los Angeles, June 2010.
  
 
== Online Version ==
 
== Online Version ==
Line 9: Line 9:
 
== Summary ==
 
== Summary ==
  
This [[Category::paper]] generalizes conventional HMMs to featurized HMMs, by replacing the multinomial conditional probability distributions (CPDs) with miniature log-linear models. Two algorithms for unsupervised training of featurized HMMs are proposed.
+
This [[Category::paper]] generalizes conventional [[UsesMethod::HMM|HMMs]] to [[UsesMethod::Featurized_HMM|featurized HMMs]], by replacing the multinomial conditional probability distributions (CPDs) with miniature log-linear models. Two algorithms for unsupervised training of featurized HMMs are proposed.
  
 
Featurized HMMs are applied to four unsupervised learning tasks:
 
Featurized HMMs are applied to four unsupervised learning tasks:
* POS induction (unsupervised version of [[Part_of_Speech_Tagging|POS tagging]]);
+
* [[AddressesProblem::Part_of_Speech_Induction|POS induction]] (unsupervised version of [[Part_of_Speech_Tagging|POS tagging]]);
* Grammar induction;
+
* [[AddressesProblem::Grammar_Induction|Grammar induction]];
* Word alignment;
+
* [[AddressesProblem::Word_Alignment|Word alignment]];
* Word segmentation.
+
* [[AddressesProblem::Word_Segmentation|Word segmentation]].
 
For all these four tasks, featurized HMMs are shown to outperform their unfeaturized counterparts by a substantial margin.
 
For all these four tasks, featurized HMMs are shown to outperform their unfeaturized counterparts by a substantial margin.
  
== Featurized HMMs ==
+
== Method ==
  
=== Definition ===
+
This paper proposes the concept of [[UsesMethod::Featurized_HMM|featurized HMMs]] and two algorithms for their unsupervised training. For a detailed elaboration, see the page [[UsesMethod::Featurized_HMM|Featurized HMMs]].
  
The definition of featurized HMMs is given in the context of POS tagging, but is applicable to the general case. In the following text, <math>\mathbf{Y} = \{Y_1, \ldots, Y_n\}</math> stands for the random sequence of words, and <math>\mathbf{Z} = \{Z_1, \ldots, Z_n\}</math> stands for the random sequence of POS tags. Their lower-case counterparts stands for possible values of the words and POS tags.
+
The paper also comes up with featurized versions of other HMM-like models, e.g. [[UsesMethod::Dependency Model with Valence]] for [[AddressesProblem::Grammar_Induction|Grammar induction]].
  
Assuming a special starting tag <math>z_0</math>, an HMM defines a joint probability over <math>\mathbf{Z}</math> and <math>\mathbf{Y}</math>:
+
== Experiments ==
  
<math>
+
=== POS Induction ===
P(\mathbf{Z} = \mathbf{z}, \mathbf{Y} = \mathbf{y}) = \prod_{i=1}^n \left[ P(Z_i = z_i | Z_{i-1} = z_{i-1}) \cdot P(Y_i = y_i | Z_i = z_i) \right]
 
</math>
 
  
While a conventional HMM defines the transition probabilities <math>P(Z_i | Z_{i-1})</math> and emission probabilities <math>P(Y_i | Z_i)</math> as multinomial distributions, a featurized HMM defines these probabilities in terms of a collection of features <math>\mathbf{f}(d, c, t)</math> and their weights <math>\mathbf{w}</math>:
+
POS induction is the unsupervised version of [[Part_of_Speech_Tagging|POS tagging]]. The output is clusters of words that the system believes to belong to the same part-of-speech. In order to evaluate the performance of a POS inductor, it is necessary to map the clusters to the actual POS tags. The best accuracy achieved by all the mapping is called the "many-1 accuracy".
  
<math>
+
{| border=1 cellpadding=3
P_\mathbf{w}(Z_i = z_2 | Z_{i-1} = z_1) = \theta_{z_2,z_1,\text{TRANS}}(\mathbf{w}) = \frac {\exp \mathbf{w}^T \mathbf{f}(z_2,z_1,\text{TRANS})} {\sum_{z_2'} \exp \mathbf{w}^T \mathbf{f}(z_2',z_1,\text{TRANS})}
+
! align=right | Dataset
</math>
+
| [[UsesDataset::Penn Treebank English WSJ]]
 +
|-
 +
! align=right | Criterion
 +
| Many-1 accuracy (the larger, the better)
 +
|-
 +
! align=right | Baseline
 +
| 63.1 ± 1.3 (HMM) (10 runs, mean ± standard deviation)
 +
|-
 +
! align=right | Performance of proposed systems
 +
| 68.1 ± 1.7 (Featurized HMM, Algorithm 1) <br> 75.5 ± 1.1 (Featurized HMM, Algorithm 2)
 +
|-
 +
! align=right | Performance of contrastive systems
 +
| 59.6 ± 6.9 (Featurized MRF) <nowiki>[</nowiki>[[RelatedPaper::Haghighi and Klein, ACL 2006: Prototype-Driven Learning for Sequence Models|Haghighi and Klein, ACL 2006]]<nowiki>]</nowiki>
 +
|}
  
<math>
+
=== Grammar Induction ===
P_\mathbf{w}(Y_i = y | Z_i = z) = \theta_{y,z,\text{EMIT}}(\mathbf{w}) = \frac {\exp \mathbf{w}^T \mathbf{f}(y,z,\text{EMIT})} {\sum_{y'} \exp \mathbf{w}^T \mathbf{f}(y',z,\text{EMIT})}
 
</math>
 
  
For conciseness, the tuples <math>z_2,z_1,\text{TRANS}</math> and <math>y,z,\text{EMIT}</math> are written as <math>d,c,t</math> (standing for ''decision'', ''context'', and ''type'' respectively):
+
{| border=1 cellpadding=3
 +
! align=right | Dataset
 +
| English: [[UsesDataset::Penn Treebank English WSJ]] <br> Chinese: [[UsesDataset::Penn Treebank Chinese]]
 +
|-
 +
! align=right | Criterion
 +
| Accuracy (the larger the better, not clear how it is defined)
 +
|-
 +
! align=right | Baseline
 +
| English 47.8, Chinese 42.5 (DMV)
 +
|-
 +
! align=right | Performance of proposed systems
 +
| English 48.3, Chinese 49.9 (Featurized DMV, Algorithm 1) <br> English 63.0, Chinese 53.6 (Featurized DMV, Algorithm 2)
 +
|-
 +
! align=right | Performance of contrastive systems
 +
| English 61.3, Chinese 51.9 <nowiki>[</nowiki>[[RelatedPaper::Cohen and Smith, ACL 2009: Shared Logistic Normal Distributions for Soft Parameter Tying in Unsupervised Grammer Induction|Cohen and Smith, ACL 2009]]<nowiki>]</nowiki>
 +
|}
  
<math>
+
=== Word Alignment ===
\theta_{d,c,t}(\mathbf{w}) = \frac {\exp \mathbf{w}^T \mathbf{f}(d,c,t)} {\sum_{d'} \exp \mathbf{w}^T \mathbf{f}(d',c,t)}
 
</math>
 
  
A conventional HMM is a special case of a featurized HMM (as long as no zero transition or emission probability is involved). We get a conventional HMM if we:
+
{| border=1 cellpadding=3
* Use the set of "basic" features, i.e. indicator functions of all tuples <math>z_2,z_1,\text{TRANS}</math> and <math>y,z,\text{EMIT}</math>;
+
! align=right | Dataset
* Set the weights of the features as the logarithm of the corresponding transition or emission probabilities.
+
| [[UsesDataset::NIST 2002 Chinese-English Development Set]]
 +
|-
 +
! align=right | Criterion
 +
| Alignment Error Rate (the smaller the better)
 +
|-
 +
! align=right | Baseline
 +
| 38.0 (Model 1) <nowiki>[</nowiki>[[RelatedPaper::Brown et al, CL 1994: The Mathematics of Statistical Machine Translation: Parameter Estimation|Brown et al, CL 1994]]<nowiki>]</nowiki> <br> 33.8 (HMM) <nowiki>[</nowiki>[[RelatedPaper::Ney and Vogel, CL 1996: HMM-Based Word Alignment in Statistical Translation|Ney and Vogel, CL 1996]]<nowiki>]</nowiki>
 +
|-
 +
! align=right | Performance of proposed systems
 +
| 35.6 (Featurized Model 1, Algorithm 1) <br> 30.0 (Featurized HMM, Algorithm 1)
 +
|}
  
=== The Estimation Problem ===
+
=== Word Segmentation ===
 
 
The estimation problem involves finding the marginal probability <math>P_\mathbf{w}(\mathbf{Y} = \mathbf{y})</math> for a given word sequence <math>\mathbf{y}</math>, with the model parameters <math>\mathbf{w}</math> known.
 
 
 
The forward-backward algorithm for conventional HMMs is directly applicable to featurized HMMs.
 
 
 
=== The Decoding Problem ===
 
 
 
The decoding problem involves finding the most probable tag sequence <math>\mathbf{\hat{z}}</math> given the word sequence <math>\mathbf{y}</math>, with the model parameters <math>\mathbf{w}</math> known:
 
 
 
<math>
 
\mathbf{\hat{z}} = \underset{\mathbf{z}}{\operatorname{arg\,max}} \, P_\mathbf{w}(\mathbf{Z} = \mathbf{z} | \mathbf{Y} = \mathbf{y})
 
</math>
 
 
 
The Viterbi algorithm for conventional HMMs is directly applicable to featurized HMMs.
 
 
 
=== The Training Problem ===
 
 
 
The training problem involves finding a set of model parameters <math>\mathbf{\hat{w}}</math> to maximize the likelihood of the training data <math>\mathbf{y}</math>:
 
 
 
<math>
 
\mathbf{\hat{w}} = \underset{\mathbf{w}}{\operatorname{arg\,max}} \, P_\mathbf{w}(\mathbf{Y} = \mathbf{y})
 
</math>
 
 
 
Or, the objective function can be a regularized version of the log-likelihood:
 
 
 
<math>
 
L(\mathbf{w}) = \log P_\mathbf{w}(\mathbf{Y} = \mathbf{y}) - \kappa ||\mathbf{w}||_2^2
 
</math>
 
 
 
The EM algorithm for unsupervised training of conventional HMMs can be adapted for featurized HMMs. The E-step calculates the expected count <math>e_{d,c,t}</math> for each tuple <math>d,c,t</math>, which makes use of the forward-backward algorithm. The M-step tries to increase the following auxiliary function:
 
 
 
<math>
 
\ell(\mathbf{w}, \mathbf{e}) = \sum_{d,c,t} e_{d,c,t} \log \theta_{d,c,t}(\mathbf{w}) - \kappa ||\mathbf{w}||_2^2
 
</math>
 
 
 
whose gradient w.r.t. <math>\mathbf{w}</math> is:
 
 
 
<math>
 
\nabla_\mathbf{w} \ell(\mathbf{w}, \mathbf{e}) = \sum_{d,c,t} e_{d,c,t} \left[ \mathbf{f}(d,c,t) - \sum_{d'} \theta_{d',c,t}(\mathbf{w}) \mathbf{f}(d',c,t) \right] - 2 \kappa \cdot \mathbf{w}
 
</math>
 
 
 
Two versions of EM algorithms are proposed in the paper:
 
 
 
 
 
 
 
== Experiments ==
 
 
 
=== POS Induction ===
 
 
 
=== Grammar Induction ===
 
 
 
=== Word Alignment ===
 
  
=== Word Segmentation ===
+
{| border=1 cellpadding=3
 +
! align=right | Dataset
 +
| [[UsesDataset::Bernstein-Ratner Corpus]]
 +
|-
 +
! align=right | Criterion
 +
| Segment F1 score (the larger the better)
 +
|-
 +
! align=right | Baseline
 +
| 76.9 ± 0.1 (Unigram) (10 runs, mean ± standard deviation)
 +
|-
 +
! align=right | Performance of proposed systems
 +
| 84.5 ± 0.5 (Featurized Unigram, Algorithm 1) <br> 88.0 ± 0.1 (Featurized Unigram, Algorithm 2)
 +
|-
 +
! align=right | Performance of contrastive systems
 +
| 87 <nowiki>[</nowiki>[[RelatedPaper::Johnson and Goldwater, ACL 2009: Improving Non-Parametric Bayesian Inerence: Experiments on Unsupervised Word Segmentation with Adaptor Grammars|Johnson and Goldwater, ACL 2009]]<nowiki>]</nowiki>
 +
|}

Latest revision as of 14:37, 18 September 2011

Citation

T. Berg-Kirkpatrick, A. Bouchard-Côté, J. DeNero and D. Klein. Painless Unsupervised Learning with Features, Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the ACL, pp. 582-590, Los Angeles, June 2010.

Online Version

PDF version

Summary

This paper generalizes conventional HMMs to featurized HMMs, by replacing the multinomial conditional probability distributions (CPDs) with miniature log-linear models. Two algorithms for unsupervised training of featurized HMMs are proposed.

Featurized HMMs are applied to four unsupervised learning tasks:

For all these four tasks, featurized HMMs are shown to outperform their unfeaturized counterparts by a substantial margin.

Method

This paper proposes the concept of featurized HMMs and two algorithms for their unsupervised training. For a detailed elaboration, see the page Featurized HMMs.

The paper also comes up with featurized versions of other HMM-like models, e.g. Dependency Model with Valence for Grammar induction.

Experiments

POS Induction

POS induction is the unsupervised version of POS tagging. The output is clusters of words that the system believes to belong to the same part-of-speech. In order to evaluate the performance of a POS inductor, it is necessary to map the clusters to the actual POS tags. The best accuracy achieved by all the mapping is called the "many-1 accuracy".

Dataset Penn Treebank English WSJ
Criterion Many-1 accuracy (the larger, the better)
Baseline 63.1 ± 1.3 (HMM) (10 runs, mean ± standard deviation)
Performance of proposed systems 68.1 ± 1.7 (Featurized HMM, Algorithm 1)
75.5 ± 1.1 (Featurized HMM, Algorithm 2)
Performance of contrastive systems 59.6 ± 6.9 (Featurized MRF) [Haghighi and Klein, ACL 2006]

Grammar Induction

Dataset English: Penn Treebank English WSJ
Chinese: Penn Treebank Chinese
Criterion Accuracy (the larger the better, not clear how it is defined)
Baseline English 47.8, Chinese 42.5 (DMV)
Performance of proposed systems English 48.3, Chinese 49.9 (Featurized DMV, Algorithm 1)
English 63.0, Chinese 53.6 (Featurized DMV, Algorithm 2)
Performance of contrastive systems English 61.3, Chinese 51.9 [Cohen and Smith, ACL 2009]

Word Alignment

Dataset NIST 2002 Chinese-English Development Set
Criterion Alignment Error Rate (the smaller the better)
Baseline 38.0 (Model 1) [Brown et al, CL 1994]
33.8 (HMM) [Ney and Vogel, CL 1996]
Performance of proposed systems 35.6 (Featurized Model 1, Algorithm 1)
30.0 (Featurized HMM, Algorithm 1)

Word Segmentation

Dataset Bernstein-Ratner Corpus
Criterion Segment F1 score (the larger the better)
Baseline 76.9 ± 0.1 (Unigram) (10 runs, mean ± standard deviation)
Performance of proposed systems 84.5 ± 0.5 (Featurized Unigram, Algorithm 1)
88.0 ± 0.1 (Featurized Unigram, Algorithm 2)
Performance of contrastive systems 87 [Johnson and Goldwater, ACL 2009]