Difference between revisions of "Daume ICML 2009"

Revision as of 23:16, 29 September 2011

Citation

"Unsupervised Search-based Structured Prediction", Hal Daume III, ICML 2009

Online Version

An online version of the paper can be found here [1]

Summary

This paper details a methodology for unsupervised SEARN. It compares the results to other methods, first on synthetic data, and then on other methods for unsupervised dependency parsing.

Algorithm

The basic SEARN algorithm is described (see the SEARN wiki article for more background).

In the supervised form, the algorithm uses a sample space of $(x,y)$ pairs, where $x$ is the input and $y$ is the true output. In the unsupervised case, we must account for the fact that the algorithm must be run on an input of simply $x$ , with the classifier still producing $y$ .

The proposed solution is to essentially predict $y$ , and then perform the normal prediction. The loss function will only be dependent on $x$ , as, while we will be predicting $y$ , we do not have any observed "true" outputs.

Given an input $x$ , which is a sequence of length $T$ , we consider the true output to be of length $2T$ , where the first $T$ components will be drawn from the possible vocabulary for $y$ (that is, the possible outputs), and represent the latent structure (which we will refer to as $y<math>,andthelast<math>T$ components are the drawn from the vocabulary for the input, $x$ . We can then use SEARN on this input. It is important to note that constructing features for these two parts is different:

For $y$ , the $T$ latent components, they can be based on $x$ and the partial latent structure (i.e., at $y_{t}$ , we can look at $y_{1},...,y_{t-1}$ )
For $x$ , the $T$ input components, they can be based on $x$ and $y$ . However, the paper notes that in designing features here, the ideal feature set will be dependent on correctly predicting the latent structure correctly. This makes intuitive sense - if we correctly predict $y$ , then we have an "optimal" label set.

In dealing with $\pi *$ , the optimal (and thus initial) policy, it already predicts fully the true input, and it can arbitrarily/randomly predict the latent input. This gives us the ability to run the first iteration of SEARN.

The paper describes the unsupervised algorithm as follows:

First iteration: Use $\pi =\pi *$ to classify the sequence of latent and input components. Note that $\pi *$ randomly or arbitrarily produces classifications for the latent components, and all of the costs are 0, so this won't even induce any update to the classifier for these components. The important piece is that $\pi *$ does predict classifications for the input components, and recall that these predictions will produce input symbols, taken from the true input.
Second iteration: Given that $\pi \neq \pi *$ due to the update, we no longer are guaranteed to produce zero-cost classifications for the latent component. Given that features for the latent structure are based partially on $x$ , they get either high costs if the classifier performs poorly, or low costs if it performs well. As a result, the classifier will begin to predict classifications for the latent component.
More iterations will cause the output to be closer to the learned classifier and further from the "optimal" policy (which arbitrarily produced the classifications).

@@ Line 19: / Line 19: @@
 In the supervised form, the algorithm uses a sample space of <math>(x,y)</math> pairs, where <math>x</math> is the input and <math>y</math> is the true output. In the unsupervised case, we must account for the fact that the algorithm must be run on an input of simply <math>x</math>, with the classifier still producing <math>y</math>.
-The proposed solution is to essentially predict <math>y</math>, and then perform the normal prediction. The loss function will only be dependent on <math>x</math>.
+The proposed solution is to essentially predict <math>y</math>, and then perform the normal prediction. The loss function will only be dependent on <math>x</math>, as, while we will be predicting <math>y</math>, we do not have any observed "true" outputs.
-Given an input <math>x</math>, which is a sequence of length <math>T</math>, we consider the true output to be of length <math>2T</math>, where the first <math>T</math> components will be sampled from the possible vocabulary for <math>y</math>, and represent the latent structure, and the last <math>T</math> components are the true input <math>x</math>. We can then use [[UsesMethod::SEARN|SEARN]] on this input.
+Given an input <math>x</math>, which is a sequence of length <math>T</math>, we consider the true output to be of length <math>2T</math>, where the first <math>T</math> components will be drawn from the possible vocabulary for <math>y</math> (that is, the possible outputs), and represent the latent structure (which we will refer to as <math>y<math>, and the last <math>T</math> components are the drawn from the vocabulary for the input, <math>x</math>. We can then use [[UsesMethod::SEARN|SEARN]] on this input. It is important to note that constructing features for these two parts is different:
-In dealing with <math>\pi*</math>, the optimal (and thus initial) policy, it already predicts fully the true input, and it can arbitrarily/randomly predict the latent input. This gives us the ability to run the first iteration of [[UsesMethod::SEARN|SEARN]]
+* For <math>y</math>, the <math>T</math> latent components, they can be based on <math>x</math> and the partial latent structure (i.e., at <math>y_t</math>, we can look at <math>y_1,...,y_{t-1}</math>)
+* For <math>x</math>, the <math>T</math> input components, they can be based on <math>x</math> and <math>y</math>. However, the paper notes that in designing features here, the ideal feature set will be dependent on correctly predicting the latent structure correctly. This makes intuitive sense - if we correctly predict <math>y</math>, then we have an "optimal" label set.
+In dealing with <math>\pi*</math>, the optimal (and thus initial) policy, it already predicts fully the true input, and it can arbitrarily/randomly predict the latent input. This gives us the ability to run the first iteration of [[UsesMethod::SEARN|SEARN]].
+The paper describes the unsupervised algorithm as follows:
+* First iteration: Use <math>\pi = \pi*</math> to classify the sequence of latent and input components. Note that <math>\pi*</math> randomly or arbitrarily produces classifications for the latent components, and all of the costs are 0, so this won't even induce any update to the classifier for these components. The important piece is that <math>\pi*</math> ''does'' predict classifications for the input components, and recall that these predictions will produce input symbols, taken from the true input.
+* Second iteration: Given that <math>\pi \neq \pi*</math> due to the update, we no longer are guaranteed to produce zero-cost classifications for the latent component. Given that features for the latent structure are based partially on <math>x</math>, they get either high costs if the classifier performs poorly, or low costs if it performs well. As a result, the classifier will begin to predict classifications for the latent component.
+* More iterations will cause the output to be closer to the learned classifier and further from the "optimal" policy (which arbitrarily produced the classifications).

Difference between revisions of "Daume ICML 2009"

Revision as of 23:16, 29 September 2011

Contents

Citation

Online Version

Summary

Algorithm

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

Tools