Difference between revisions of "Smith and Eisner 2005:Contrastive Estimation: Training Log-Linear Models on Unlabeled Data"

Latest revision as of 17:16, 13 October 2011

Citation

Smith, Noah A. and Jason Eisner (2005). Contrastive estimation: Training log-linear models on unlabeled data. Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL), pages 354-362, Ann Arbor, Michigan, June.

Online version

Smith and Eisner 2005

Summary

This is an interesting paper that presents an unsupervised Contrastive Estimation method for Conditional Random Fields and other Log-Linear Models, which can be easily applied to estimation problems in Part of Speech Tagging, Named Entity Recognition and Semantic Role Labeling. When applying this technique to POS tagging, the observed results outperforms Expectation Maximization, and is robust when the dictionary quality is poor.

Brief Descriptions of the Methods

The authors discuss three major techniques in this paper: Contrastive Estimation, Numerical Optimization, and three Neighborhood Functions. Since we have already introduced the Contrastive Estimation method on a separate method page, we briefly introduce the numerical optimization method used by this paper, and focus on three Neighborhood Functions that are implemented as Contrastive Estimation plug-ins.

Contrastive Estimation

Please refer to this method page: Contrastive Estimation.

Numerical Optimization

The numerical optimization in this contrastive estimation denominator is not the same as the ones used supervised CRFs.

\sum _{i}\mathbf {E} _{\theta }[f_{j}|x_{i}]-\mathbf {E} _{\theta }[f_{j}|Neighbor(x_{i})]

Here, the authors use the L-BFGS approach, and do not use the observed feature value $f_{j}(x_{i},y_{i}^{*})$ . Note that this function is not concave, which means that it might fall in to a local optimum, rather than the global optimum. This suggests that the initialization of parameter $\theta$ is crucial in this context.

Lattice Neighborhoods

Since the proposed technique optimizes the denominator of

\log \prod _{i}{\frac {\sum _{(y\in Y)}u(x,y|\theta )}{\sum _{(x,y)\in Neighbor(x_{i})\times Y}u(x,y|\theta )}}

it is clear that the neighborhood function is the key to performance of the POS tagging task. Given a input string $x_{1}^{m}$ of $x={x_{1},x_{2},...,x_{m}}$ and a substring $x_{i}^{j}$ of $x={x_{i},x_{i+1},...,X_{j}}$ , the direct representation of the neighbors that delete a single symbol is:

DEL1WORD(x_{1}^{m})=\{x_{1}^{l-1}x_{l+1}^{m}|1\leq l\leq m\}\cup \{x_{1}^{m}\}

Another neighbor function is to randomly transpose any pair of adjacent words:

TRANS1(x_{1}^{m})=\{x_{1}^{l-1}x_{l+1}x_{l}x_{l+2}^{m}|1\leq l<m\}\cup \{x_{1}^{m}\}

If we combine the $DEL1WORD(x_{1}^{m})$ and the $TRANS1(x_{1}^{m})$ functions by taking the $union$ , we can generate a large set of neighbors, $DELORTRANS1(x_{1}^{m})$ . Besides the three functions above, the authors also propose the $DEL1SUBSEQ$ function, which allows the deletion of any contiguous subsequence of words that is smaller than $x_{1}^{m}$ . Finally, a $LENGTH$ function that consists of $\Sigma ^{m}$ is also proposed.

The Datasets

The dataset used in this paper is the Penn Treebank English WSJ corpus, which allows the authors to vary the amount from 12K to 96K. The authors have mentioned that around 2.3 tags (per token) have to be considered by the tagger.

Experimental Results

The authors conduct two major experiments: first, they compare the contrastive estimation approach with existing EM approach on POS tagging with unlabeled data; secondly, the author has also experimented with the robustness of this approach by removing knowledge from the lexicon, and adding features. To analyze the experiment results precisely, we take the original figures and tables in the experiment section from the Smith and Eisner 2005 paper.

Comparison with EM

In the above figure, it is clear that fully supervised methods are still the state-of-the-art methods for POS tagging when labeled data are present. However, when comparing to EM in an unsupervised setting, the LENGTH model obtains an Viterbi accuracy of 78.9%, which is significantly higher than the EM baseline (60.9%) on the 96K-word dataset. DEL1WORD and DEL1SUBSEQ are even worse than the EM on larger datasets. The DELORTR1 and TRAN1 models also achieve good results.

Evaluation on Robustness in the Unsupervised Settings

In Table 3, it shows the experiment results when tagging dictionary is tweaked. It is observed that when removing knowledge from the lexicon, all models are worse than before. Among them, the LENGTH model suffers the most. DELORTRAN1 and TRANS1 do not manipulate emission weights for OOV, so it does not suffer so badly. The most interesting finding of this experiment is that when adding the spelling features, the DELORTRANS1 and TRANS1 model almost completely recover the original accuracy, though LENGTH does not fully recovered.

Related papers

From a structured prediction perspective, this paper presents an interesting contrastive estimation approach that can be compared with many existing estimation techniques, for example, joint likelihood maximization in HMMs, conditional likelihood estimation and sum of conditional likelihoods. Secondly, this paper is also in line with some other numerical optimization approach that optimizes a convex objective function. Moreover, from the empirical evaluation standpoint, the proposed unsupervised approach might not be able to outperform the standard supervised POS tagging, but it can be applied to some sequential modeling tasks where labeled data are not abundantly available, for example, Named Entity Tagging, Parts-of-speech Tagging, and Constituent Parsing for resource-poor languages. Below shows some of the related papers to this work.

Globerson et al. ICML 2007. Exponentiated Gradient Algorithms for Log Linear Structured Prediction

Berg-Kirkpatrick et al, ACL 2010: Painless Unsupervised Learning with Features

Klein 2002 conditional structure versus conditional estimation in nlp models

@@ Line 11: / Line 11: @@
 This is an interesting [[Category::paper]] that presents an unsupervised [[UsesMethod::Contrastive Estimation]] method for [[UsesMethod::Conditional Random Fields]] and other [[AddressesProblem::Log-Linear Models]], which can be easily applied to estimation problems in [[AddressesProblem::Part of Speech Tagging]], [[AddressesProblem::Named Entity Recognition]] and [[AddressesProblem::Semantic Role Labeling]]. When applying this technique to POS tagging, the observed results outperforms [[UsesMethod::Expectation Maximization]], and is robust when the dictionary quality is poor.
-== Brief description of the method ==
+== Brief Descriptions of the Methods ==
-== The Datasets ==
+The authors discuss three major techniques in this paper: [[UsesMethod::Contrastive Estimation]], [[UsesMethod::Numerical Optimization]], and three [[UsesMethod::Neighborhood Functions]]. Since we have already introduced the [[UsesMethod::Contrastive Estimation]] method on a separate method page, we briefly introduce the numerical optimization method used by this paper, and focus on three [[UsesMethod::Neighborhood Functions]] that are implemented as [[UsesMethod::Contrastive Estimation]] plug-ins.
+=== Contrastive Estimation ===
+Please refer to this method page: [[UsesMethod::Contrastive Estimation]].
+=== Numerical Optimization ===
+The numerical optimization in this contrastive estimation denominator is not the same as the ones used supervised CRFs.
+:<math> \sum_i \mathbf{E}_{\theta} [f_j|x_i] - \mathbf{E}_{\theta} [f_j|Neighbor(x_i)] </math>
+Here, the authors use the L-BFGS approach, and do not use the observed feature value <math>f_j(x_i,y_i^*)</math>.
+Note that this function is not concave, which means that it might fall in to a local optimum, rather than the global optimum.
+This suggests that the initialization of parameter <math>\theta</math> is crucial in this context.
+=== Lattice Neighborhoods ===
+Since the proposed technique optimizes the denominator of
+:<math> \log \prod_{i} \frac{\sum_{(y \in Y)} u (x,y|\theta)}{\sum_{(x,y)\in Neighbor (x_i) \times Y} u (x,y|\theta)}</math>
+it is clear that the neighborhood function is the key to performance of the POS tagging task.
+Given a input string <math>x_{1}^{m}</math> of <math>x = {x_1, x_2, ..., x_m}</math> and a substring <math>x_{i}^{j}</math> of <math>x = {x_i, x_{i+1}, ..., X_j}</math>, the direct representation of the neighbors that delete a single symbol is:
+:<math>DEL1WORD(x_{1}^{m}) = \{x_1^{l-1}x_{l+1}^m | 1\le l \le m\} \cup \{x_1^{m}\}  </math>
+Another neighbor function is to randomly transpose any pair of adjacent words:
+:<math>TRANS1(x_{1}^{m}) = \{x_1^{l-1}x_{l+1}x_{l}x_{l+2}^m | 1\le l < m\} \cup \{x_1^{m}\}  </math>
+If we combine the <math>DEL1WORD(x_{1}^{m}) </math> and the <math>TRANS1(x_{1}^{m})</math> functions by taking the <math>union</math>,
+we can generate a large set of neighbors, <math>DELORTRANS1(x_{1}^{m}) </math>.
+Besides the three functions above, the authors also propose the <math>DEL1SUBSEQ </math> function, which allows the deletion of any contiguous subsequence of words that is smaller than <math>x_{1}^{m}</math>. Finally, a <math>LENGTH </math> function that consists of <math>\Sigma^m</math>
+is also proposed.
+== The Datasets ==
+The dataset used in this paper is the [[UsesDataset::Penn Treebank English WSJ]] corpus, which allows the authors to vary the amount from 12K to 96K. The authors have mentioned that around 2.3 tags (per token) have to be considered by the tagger.
 == Experimental Results ==

Difference between revisions of "Smith and Eisner 2005:Contrastive Estimation: Training Log-Linear Models on Unlabeled Data"

Latest revision as of 17:16, 13 October 2011

Contents

Citation

Online version

Summary

Brief Descriptions of the Methods

Contrastive Estimation

Numerical Optimization

Lattice Neighborhoods

The Datasets

Experimental Results

Comparison with EM

Evaluation on Robustness in the Unsupervised Settings

Related papers

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

Tools