Difference between revisions of "Daume ICML 2009"
(10 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
− | |||
− | |||
− | |||
− | |||
== Citation == | == Citation == | ||
Line 9: | Line 5: | ||
== Online Version == | == Online Version == | ||
− | An online version of the paper can be found here [http://www.umiacs.umd.edu/~hal/docs/daume09unsearn.pdf] | + | An online version of the [[Category::Paper|paper]] can be found here [http://www.umiacs.umd.edu/~hal/docs/daume09unsearn.pdf] |
+ | |||
+ | == Summary == | ||
+ | |||
+ | This paper details a methodology for [[AddressesProblem::Unsupervised Learning|unsupervised]] [[UsesMethod::SEARN|SEARN]]. It compares the results to other methods, first on synthetic data, and then on other methods for unsupervised [[AddressesProblem::Dependency Parsing|dependency parsing]], as well as briefly touching on semi-supervised learning with [[UsesMethod::SEARN|SEARN]]. | ||
+ | |||
+ | == Algorithm == | ||
+ | |||
+ | The basic [[UsesMethod::SEARN|SEARN]] algorithm is described (see the SEARN wiki article for more background). | ||
+ | |||
+ | In the supervised form, the algorithm uses a sample space of <math>(x,y)</math> pairs, where <math>x</math> is the input and <math>y</math> is the true output. In the unsupervised case, we must account for the fact that the algorithm must be run on an input of simply <math>x</math>, with the classifier still producing <math>y</math>. | ||
+ | |||
+ | The proposed solution is to essentially predict <math>y</math>, and then perform the normal prediction. The loss function will only be dependent on <math>x</math>, as, while we will be predicting <math>y</math>, we do not have any observed "true" outputs. | ||
+ | |||
+ | Given an input <math>x</math>, which is a sequence of length <math>T</math>, we consider the true output to be of length <math>2T</math>, where the first <math>T</math> components will be drawn from the possible vocabulary for <math>y</math> (that is, the possible outputs), and represent the latent structure (which we will refer to as <math>y<math>, and the last <math>T</math> components are the drawn from the vocabulary for the input, <math>x</math>. We can then use [[UsesMethod::SEARN|SEARN]] on this input. It is important to note that constructing features for these two parts is different: | ||
+ | |||
+ | * For <math>y</math>, the <math>T</math> latent components, they can be based on <math>x</math> and the partial latent structure (i.e., at <math>y_t</math>, we can look at <math>y_1,...,y_{t-1}</math>) | ||
+ | * For <math>x</math>, the <math>T</math> input components, they can be based on <math>x</math> and <math>y</math>. However, the paper notes that in designing features here, the ideal feature set will be dependent on correctly predicting the latent structure correctly. This makes intuitive sense - if we correctly predict <math>y</math>, then we have an "optimal" label set. | ||
+ | |||
+ | In dealing with <math>\pi*</math>, the optimal (and thus initial) policy, it already predicts fully the true input, and it can arbitrarily/randomly predict the latent input. This gives us the ability to run the first iteration of [[UsesMethod::SEARN|SEARN]]. | ||
+ | |||
+ | The paper describes the unsupervised algorithm as follows: | ||
+ | |||
+ | * First iteration: Use <math>\pi = \pi*</math> to classify the sequence of latent and input components. Note that <math>\pi*</math> randomly or arbitrarily produces classifications for the latent components, and all of the costs are 0, so this won't even induce any update to the classifier for these components. The important piece is that <math>\pi*</math> ''does'' predict classifications for the input components, and recall that these predictions will produce input symbols, taken from the true input. | ||
+ | * Second iteration: Given that <math>\pi \neq \pi*</math> due to the update, we no longer are guaranteed to produce zero-cost classifications for the latent component. Given that features for the latent structure are based partially on <math>x</math>, they get either high costs if the classifier performs poorly, or low costs if it performs well. As a result, the classifier will begin to predict classifications for the latent component. | ||
+ | * More iterations will cause the output to be closer to the learned classifier and further from the "optimal" policy (which arbitrarily produced the classifications). | ||
+ | |||
+ | The paper also shows that, given a certain set of definitions, this approach is pretty much equivalent to the [[UsesMethod::Expectation-maximization algorithm|EM algorithm]]. It is also noteworthy that the unsupervised algorithm no longer has all of the theoretical guarantees of the supervised algorithm. | ||
+ | |||
+ | == Experiments == | ||
+ | |||
+ | They use two experiments in the paper. The first is a synthetic experiment using data sampled from HMMs and comparing it to other methods, while the second is applying the method to [[AddressesProblem::Dependency Parsing|dependency parsing]] and comparing it to other, similar methods. | ||
+ | |||
+ | === Synthetic Experiments === | ||
+ | |||
+ | The first experiments involve generating synthetic data from a first order HMM and a second order HMM. The "truth" is produced by using the generating HMM to predict the sequence. The methods compared are: [[UsesMethod::Expectation-maximization algorithm|EM]], [[UsesMethod::SEARN|SEARN]] with [[UsesMethod::HMM|HMM]] features and using [[UsesMethod::Naive Bayes|naive Bayes]], and [[UsesMethod::SEARN|SEARN]] with a [[UsesMethod::Logistic Regression|logistic regression]] classifier. | ||
+ | |||
+ | {| class="wikitable" border="1" | ||
+ | |- | ||
+ | ! Model | ||
+ | ! States | ||
+ | ! Truth (Error Rate) | ||
+ | ! EM (Error Rate) | ||
+ | ! SEARN-NB (Error Rate) | ||
+ | ! SEARN-LR (Error Rate) | ||
+ | |- | ||
+ | | 1st Order HMM | ||
+ | | K = 2 | ||
+ | | <math>0.227 \pm0.107</math> | ||
+ | | <math>0.275 \pm0.128</math> | ||
+ | | <math>0.287 \pm0.138</math> | ||
+ | | <math>0.276 \pm0.095</math> | ||
+ | |- | ||
+ | | 1st Order HMM | ||
+ | | K = 5 | ||
+ | | <math>0.687 \pm0.043</math> | ||
+ | | <math>0.678 \pm0.026</math> | ||
+ | | <math>0.688 \pm0.025</math> | ||
+ | | <math>0.672 \pm0.022</math> | ||
+ | |- | ||
+ | | 1st Order HMM | ||
+ | | K = 10 | ||
+ | | <math>0.806 \pm0.035</math> | ||
+ | | <math>0.762 \pm0.021</math> | ||
+ | | <math>0.771 \pm0.019</math> | ||
+ | | <math>0.755 \pm0.019</math> | ||
+ | |- | ||
+ | | 2nd Order HMM | ||
+ | | K = 2 | ||
+ | | <math>0.294 \pm0.072</math> | ||
+ | | <math>0.396 \pm0.057</math> | ||
+ | | <math>0.408 \pm0.056</math> | ||
+ | | <math>0.271 \pm0.057</math> | ||
+ | |- | ||
+ | | 2nd Order HMM | ||
+ | | K = 5 | ||
+ | | <math>0.651 \pm0.068</math> | ||
+ | | <math>0.695 \pm0.027</math> | ||
+ | | <math>0.710 \pm0.016</math> | ||
+ | | <math>0.633 \pm0.018</math> | ||
+ | |- | ||
+ | | 2nd Order HMM | ||
+ | | K = 10 | ||
+ | | <math>0.815 \pm0.032</math> | ||
+ | | <math>0.764 \pm0.021</math> | ||
+ | | <math>0.771 \pm0.015</math> | ||
+ | | <math>0.705 \pm0.019</math> | ||
+ | |} | ||
+ | |||
+ | The results from the first order HMM are uninteresting, as they are all comparable (i.e. within the standard deviation of the optimal algorithm). | ||
+ | |||
+ | Of the second order HMM results, the best performer is always the [[UsesMethod::SEARN]] with [[UsesMethod::Logistic Regression]]. This is due to the richer set of features that it contains. | ||
+ | |||
+ | === Dependency Parsing Experiments === | ||
+ | |||
+ | The more interesting experiment is about [[AddressesProblem::Dependency Parsing|dependency parsing]]. It uses the [[UsesDataset::WSJ10 corpus|WSJ10 corpus]], with: | ||
+ | * 5301 sentences of training data | ||
+ | * 531 sentences of development data | ||
+ | * 530 sentences of blind test data | ||
+ | |||
+ | The systems compared are as follows (full paper citations at the end): | ||
+ | * Rand-Gen is a random generative baseline | ||
+ | * Rand-SEARN is the baseline given by the SEARN initial policy | ||
+ | * 2 versions of the model from Klein and Manning, which provides an EM-based model | ||
+ | ** K+M:Rand-Init is initialized with a random value | ||
+ | ** K+M:Smart-Init is initialized intelligently | ||
+ | * 3 variants on Smith and Eisner (S+E:Length, S+E:DelOrTrans1, S+E:Trans1) | ||
+ | * SEARN: Unsup is unsupervised [[UsesMethod::SEARN]] | ||
+ | * S+E: Sup is Smith and Eisner's supervised model (for an upper bound) | ||
+ | * SEARN: Sup is supervised [[UsesMethod::SEARN]], also for an upper bound | ||
+ | |||
+ | The following table reports the accuracy of each method. | ||
+ | |||
+ | {| class="wikitable" border="1" | ||
+ | |- | ||
+ | ! Algorithm | ||
+ | ! Training Accuracy | ||
+ | ! Test Accuracy | ||
+ | ! # of iterations | ||
+ | |- | ||
+ | | Rand-Gen | ||
+ | | <math>23.5 \pm0.9</math> | ||
+ | | <math>23.5 \pm1.3</math> | ||
+ | |- | ||
+ | | Rand-SEARN | ||
+ | | <math>21.3 \pm0.2</math> | ||
+ | | <math>21.0 \pm0.6</math> | ||
+ | |- | ||
+ | | K+M:Rand-Init | ||
+ | | <math>23.6 \pm3.8</math> | ||
+ | | <math>23.6 \pm4.3</math> | ||
+ | | 63.3 | ||
+ | |- | ||
+ | | K+M:Smart-Init | ||
+ | | <math>35.2 \pm6.6</math> | ||
+ | | <math>35.2 \pm6.0</math> | ||
+ | | 64.1 | ||
+ | |- | ||
+ | | S+E:Length | ||
+ | | <math>33.8 \pm3.6</math> | ||
+ | | <math>33.7 \pm5.9</math> | ||
+ | | 173.1 | ||
+ | |- | ||
+ | | S+E:DelOrTrans1 | ||
+ | | <math>47.3 \pm6.0</math> | ||
+ | | <math>47.1 \pm5.9</math> | ||
+ | | 132.2 | ||
+ | |- | ||
+ | | S+E:Trans1 | ||
+ | | <math>48.8 \pm0.9</math> | ||
+ | | <math>49.0 \pm1.5</math> | ||
+ | | 173.4 | ||
+ | |- | ||
+ | | SEARN: Unsup | ||
+ | | <math>45.8 \pm1.6</math> | ||
+ | | <math>45.4 \pm2.2</math> | ||
+ | | 27.6 | ||
+ | |- | ||
+ | | S+E: Sup | ||
+ | | <math>79.9 \pm0.2</math> | ||
+ | | <math>78.6 \pm0.8</math> | ||
+ | | 350.5 | ||
+ | |- | ||
+ | | SEARN: Sup | ||
+ | | <math>81.0 \pm0.3</math> | ||
+ | | <math>81.6 \pm0.4</math> | ||
+ | | 24.4 | ||
+ | |} | ||
+ | |||
+ | Though Unsupervised [[UsesMethod::SEARN]] doesn't perform the best (it performs worse than the the best Smith and Eisner models), it converges significantly more quickly than any of the other algorithms. The results for the supervised methods cannot be fairly compared, as [[UsesMethod::SEARN]] contains additional features that the Smith and Eisner model does not, but it still shows a reasonable upper bound (and continues to hold the concept that [[UsesMethod::SEARN]] converges more quickly | ||
+ | |||
+ | == Semi-supervised == | ||
+ | |||
+ | The unsupervised SEARN can be expanded to semi-supervised simply by using the latent component within the loss function. The results show that the semi-supervised technique, when compared to the supervised and unsupervised technique, begins with the highest accuracy (70%), and slowly grows to the maximum (of about 80%), while the supervised begins very low (~40%), and grows up to about 80% when 5000+ labeled examples have been provided. The accuracy of the unsupervised method goes from about 25% to 40%. | ||
+ | |||
+ | == Related Papers == | ||
+ | The papers on the methods compared in the experiments are: | ||
+ | * [[RelatedPaper::Smith and Eisner guiding unsupervised grammar induction using contrastive estimation|Smith, N. A., & Eisner, J. (2005). Guiding unsupervised grammar induction using contrastive estimation. IJCAI Wk. on Grammatical Inference Apps (pp. 73-82).]] | ||
+ | * [[RelatedPaper::Klein and Manning corpus-based induction of syntactic structure|Klein, D., & Manning, C. (2004). Corpus-based induction of syntactic structure: Models of dependency and constituency. Proc. Conf. of the Assoc. for Computational Linguistics (pp. 478-485)]] | ||
+ | |||
+ | Extra papers on [[UsesMethod::SEARN]] can be found: | ||
+ | * [[RelatedPaper::Daume et al, ML 2009|Daume, H., Langford, J., and Marcu, D. 2009. Search-based structured prediction. Machine Learning. 75. 3. p297-325]] | ||
+ | * [[RelatedPaper::Daume et al, 2006|Duame, H., Langford, J., and Marcu, D. 2006. SEARN in Practice. Unpublished Manuscript]] |
Latest revision as of 10:54, 30 September 2011
Contents
Citation
"Unsupervised Search-based Structured Prediction", Hal Daume III, ICML 2009
Online Version
An online version of the paper can be found here [1]
Summary
This paper details a methodology for unsupervised SEARN. It compares the results to other methods, first on synthetic data, and then on other methods for unsupervised dependency parsing, as well as briefly touching on semi-supervised learning with SEARN.
Algorithm
The basic SEARN algorithm is described (see the SEARN wiki article for more background).
In the supervised form, the algorithm uses a sample space of pairs, where is the input and is the true output. In the unsupervised case, we must account for the fact that the algorithm must be run on an input of simply , with the classifier still producing .
The proposed solution is to essentially predict , and then perform the normal prediction. The loss function will only be dependent on , as, while we will be predicting , we do not have any observed "true" outputs.
Given an input , which is a sequence of length , we consider the true output to be of length , where the first components will be drawn from the possible vocabulary for (that is, the possible outputs), and represent the latent structure (which we will refer to as components are the drawn from the vocabulary for the input, . We can then use SEARN on this input. It is important to note that constructing features for these two parts is different:
- For , the latent components, they can be based on and the partial latent structure (i.e., at , we can look at )
- For , the input components, they can be based on and . However, the paper notes that in designing features here, the ideal feature set will be dependent on correctly predicting the latent structure correctly. This makes intuitive sense - if we correctly predict , then we have an "optimal" label set.
In dealing with , the optimal (and thus initial) policy, it already predicts fully the true input, and it can arbitrarily/randomly predict the latent input. This gives us the ability to run the first iteration of SEARN.
The paper describes the unsupervised algorithm as follows:
- First iteration: Use to classify the sequence of latent and input components. Note that randomly or arbitrarily produces classifications for the latent components, and all of the costs are 0, so this won't even induce any update to the classifier for these components. The important piece is that does predict classifications for the input components, and recall that these predictions will produce input symbols, taken from the true input.
- Second iteration: Given that due to the update, we no longer are guaranteed to produce zero-cost classifications for the latent component. Given that features for the latent structure are based partially on , they get either high costs if the classifier performs poorly, or low costs if it performs well. As a result, the classifier will begin to predict classifications for the latent component.
- More iterations will cause the output to be closer to the learned classifier and further from the "optimal" policy (which arbitrarily produced the classifications).
The paper also shows that, given a certain set of definitions, this approach is pretty much equivalent to the EM algorithm. It is also noteworthy that the unsupervised algorithm no longer has all of the theoretical guarantees of the supervised algorithm.
Experiments
They use two experiments in the paper. The first is a synthetic experiment using data sampled from HMMs and comparing it to other methods, while the second is applying the method to dependency parsing and comparing it to other, similar methods.
Synthetic Experiments
The first experiments involve generating synthetic data from a first order HMM and a second order HMM. The "truth" is produced by using the generating HMM to predict the sequence. The methods compared are: EM, SEARN with HMM features and using naive Bayes, and SEARN with a logistic regression classifier.
Model | States | Truth (Error Rate) | EM (Error Rate) | SEARN-NB (Error Rate) | SEARN-LR (Error Rate) |
---|---|---|---|---|---|
1st Order HMM | K = 2 | ||||
1st Order HMM | K = 5 | ||||
1st Order HMM | K = 10 | ||||
2nd Order HMM | K = 2 | ||||
2nd Order HMM | K = 5 | ||||
2nd Order HMM | K = 10 |
The results from the first order HMM are uninteresting, as they are all comparable (i.e. within the standard deviation of the optimal algorithm).
Of the second order HMM results, the best performer is always the SEARN with Logistic Regression. This is due to the richer set of features that it contains.
Dependency Parsing Experiments
The more interesting experiment is about dependency parsing. It uses the WSJ10 corpus, with:
- 5301 sentences of training data
- 531 sentences of development data
- 530 sentences of blind test data
The systems compared are as follows (full paper citations at the end):
- Rand-Gen is a random generative baseline
- Rand-SEARN is the baseline given by the SEARN initial policy
- 2 versions of the model from Klein and Manning, which provides an EM-based model
- K+M:Rand-Init is initialized with a random value
- K+M:Smart-Init is initialized intelligently
- 3 variants on Smith and Eisner (S+E:Length, S+E:DelOrTrans1, S+E:Trans1)
- SEARN: Unsup is unsupervised SEARN
- S+E: Sup is Smith and Eisner's supervised model (for an upper bound)
- SEARN: Sup is supervised SEARN, also for an upper bound
The following table reports the accuracy of each method.
Algorithm | Training Accuracy | Test Accuracy | # of iterations |
---|---|---|---|
Rand-Gen | |||
Rand-SEARN | |||
K+M:Rand-Init | 63.3 | ||
K+M:Smart-Init | 64.1 | ||
S+E:Length | 173.1 | ||
S+E:DelOrTrans1 | 132.2 | ||
S+E:Trans1 | 173.4 | ||
SEARN: Unsup | 27.6 | ||
S+E: Sup | 350.5 | ||
SEARN: Sup | 24.4 |
Though Unsupervised SEARN doesn't perform the best (it performs worse than the the best Smith and Eisner models), it converges significantly more quickly than any of the other algorithms. The results for the supervised methods cannot be fairly compared, as SEARN contains additional features that the Smith and Eisner model does not, but it still shows a reasonable upper bound (and continues to hold the concept that SEARN converges more quickly
Semi-supervised
The unsupervised SEARN can be expanded to semi-supervised simply by using the latent component within the loss function. The results show that the semi-supervised technique, when compared to the supervised and unsupervised technique, begins with the highest accuracy (70%), and slowly grows to the maximum (of about 80%), while the supervised begins very low (~40%), and grows up to about 80% when 5000+ labeled examples have been provided. The accuracy of the unsupervised method goes from about 25% to 40%.
Related Papers
The papers on the methods compared in the experiments are:
- Smith, N. A., & Eisner, J. (2005). Guiding unsupervised grammar induction using contrastive estimation. IJCAI Wk. on Grammatical Inference Apps (pp. 73-82).
- Klein, D., & Manning, C. (2004). Corpus-based induction of syntactic structure: Models of dependency and constituency. Proc. Conf. of the Assoc. for Computational Linguistics (pp. 478-485)
Extra papers on SEARN can be found: