Difference between revisions of "Structured Output Learning with Indirect Supervision"

From Cohen Courses
Jump to navigationJump to search
 
(4 intermediate revisions by the same user not shown)
Line 25: Line 25:
 
They expand upon this by adding an additional term to take into account the learning from the binary/indirect supervision training data:
 
They expand upon this by adding an additional term to take into account the learning from the binary/indirect supervision training data:
 
<math> \min_{\mathbf{w}} \frac{||\mathbf{w}||^2}{2} + C_1 \sum_{i \in S} L_S (\mathbf{x}_i, \mathbf{h}_i, \mathbf{w}) + C_2 \sum_{i \in B} L_B(\mathbf{x}_i, y_i, \mathbf{w}) </math>, where <math>L_B</math> is the loss for the binary prediction.  <math>L_B</math> is defined as:
 
<math> \min_{\mathbf{w}} \frac{||\mathbf{w}||^2}{2} + C_1 \sum_{i \in S} L_S (\mathbf{x}_i, \mathbf{h}_i, \mathbf{w}) + C_2 \sum_{i \in B} L_B(\mathbf{x}_i, y_i, \mathbf{w}) </math>, where <math>L_B</math> is the loss for the binary prediction.  <math>L_B</math> is defined as:
<math>L_B (\mathbf{x}_i, y_i, \mathbf{w}) = \mathcal{l} \left(1-y_i \max_{\mathbf{h} \in \mathcal{H}(\mathbf{x})} (\mathbf{w}^T \Phi_B(\mathbf{x}_i, \mathbf{h}))\right)</math>.  <math>\Phi_B</math> is simply the feature vector normalized according to the input.  
+
<math>L_B (\mathbf{x}_i, y_i, \mathbf{w}) = \ell \left(1-y_i \max_{\mathbf{h} \in \mathcal{H}(\mathbf{x})} (\mathbf{w}^T \Phi_B(\mathbf{x}_i, \mathbf{h}))\right)</math>.  <math>\Phi_B</math> is simply the feature vector normalized according to the input.  
  
The authors then present the optimization algorithm they use to solve the learning problem formulated above.   
+
The authors then present the optimization algorithm they use to solve the learning problem formulated above.  They first discuss the convexity properties of the objective function, and note that the summation in the third term that is associated with the indirect supervision can be broken down into two summations, <math>G_1 = C_2 \sum_{i \in B^-} L_B(\mathbf{x}_i, y_i, \mathbf{w}) </math> and <math>G_2 = C_2 \sum_{i \in B^+} L_B(\mathbf{x}_i, y_i, \mathbf{w})</math>.  The second term in this breakdown, where we sum over the positive binary labels, is problematic because of the definition of a valid output: we need to find the best structure, and check if it is well-formed, and this maximization (finding the best structure) results in non-convex properties. To get around this problem, the authors modify the Concave-Convex procedure (CCCP) presented by Yuille & Rangarajan, Neural Computation 2003.  The basic idea is, for a fixed weight vector, one can define a convex approximation to <math>G_2</math>.  We have to recompute this approximation every time we update our weight vector, i.e. at every iteration.  At every iteration, the new convex problem is optimized in the dual using coordinate descent.
  
 
== Baseline & Results ==  
 
== Baseline & Results ==  
 +
 +
The approach was applied to three problems:
 +
* Named Entity Transliteration
 +
* POS Tagging
 +
* Information Extraction
 +
 +
The baseline was the structured SVM. 
 +
 +
=== Named Entity Transliteration ===
 +
 +
This task is also known as phonetic transliteration alignment, and its main goal is to find the best phonetic alignment between the character sequences of two named entities.  For example, the named entity 'Italy' in English' would be 'Italia' in Italian, and in order to recognize that two named entities in two different languages correspond to the same concept, one would need to train a cross-lingual phonetic alignment model.  The dataset they used is available online [http://flake.cs.uiuc.edu/~mchang21/data.html], and consists of 300 Hebrew-English named entity pairs, with manual annotations of the alignments between the NE segments.  the training/test split was 100/200 pairs.  In addition, they augmented this data with 1,000 positive and 10,000 negative pairs, obtained from web crawls.  They varied the number of fully annotated, positive, and negative supervised points and report F1 scores:
 +
[[File:f1net.png]]
 +
 +
=== POS Tagging ===
 +
 +
The POS data is from the WSJ corpus.  25,600 tokens (1000 sentences) were used for training and testing.  From a separate set of 25,600 tokens, 2000 indirect supervision (1000 positive, 1000 negative) examples were created.  No dictionary was used. Results in terms of POS accuracy are reported.  Indirect supervision really helps when the fully annotated corpus is small. 
 +
 +
[[File:posresults.png]]
 +
 +
=== Information Extraction ===
 +
 +
Two datasets were used here:
 +
* the dataset from [[RelatedPaper::Frietag_2000_Maximum_Entropy_Markov_Models_for_Information_Extraction_and_Segmentation| McCallum et al, ICML 2000]]
 +
300 fully annotated examples for training, 100 for development, and 100 for test. 
 +
* Craigslist advertisements data for housing (e.g., size, rent, neighborhood).  This dataset is also available online[http://nlp.stanford.edu/~grenager/]
 +
100 fully annotated examples for training, same for dev and test.
 +
 +
Positive and negative data consisted of 1000 examples each (50k tokens for citations, 200k for advertisements).  Token level accuracy is measured, where the number of fully annotated examples is varied.  Some decent results here too: J-LIS with 500 examples outperforms structured SVM with 1000 examples. 
 +
 +
[[File:ieresults.png]]
  
 
== Related Work ==
 
== Related Work ==
 +
 +
* [[Contrastive_Estimation | Contrastive Estimation]] [[RelatedPaper:: Smith_and_Eisner_2005:Contrastive_Estimation:_Training_Log-Linear_Models_on_Unlabeled_Data| paper link]] is fairly similar to this work, and the authors dedicate a lot of discussion to comparisons.  CE is essentially J-LIS without any fully annotated examples.  CE cannot be applied when the relationship between good and bad examples is not known (unlike J-LIS), and also needs to marginalize over all possible hidden structures (J-LIS simply searches for the best structure). In terms of performance, they compare CE, J-LIS, and EM, and J-LIS falls somewhere in between EM and CE, primarily because J-LIS only finds the single best structure, which without good initialization may cause J-LIS to commit to a structure too early.  This was done on J-LIS without any fully labeled examples, hence the lack of initialization.  With just a couple of labeled examples, J-LIS significantly outperforms CE. 
 +
 +
* The baseline that they built on top of was the structured SVM [[RelatedPaper::Tsochantaridis et al 2005: Large Margin Methods for Structured and Interdependent Output Variables]].

Latest revision as of 20:46, 29 September 2011

Citation

Structured Output Learning with Indirect Supervision, by M.W. Chang, V.Srikumar, D.Goldwasser, D.Roth. In Proceedings of the 27th International Conference on Machine Learning, 2010.

This Paper is available online [1].

Summary

The problem with a lot of structured output problems is that is often time-consuming and difficult to obtain labelings and annotated structures for training. For example, one may need to obtain a parse tree, a POS tag sequence, or a translation for every given sentence in your training data. Naturally, this becomes more onerous the larger your training set. The key idea of this paper is that often, structured output problems have a companion learning problem which is to determine whether a given structure is legitimate or not. For example, for POS tagging, the binary companion problem is to determine whether the POS tag sequence for a given input sentence is valid or not. The basic assumption (which seems quite realistic to me) is that "direct supervision", which in this case would mean the ground truth tag sequence, is difficult and cumbersome to obtain, obtaining binary annotations (good/bad, legitimate/illegitimate) is far easier.

This paper presents a large margin-based framework that learns on both fully supervised (i.e., traditional supervised training) data as well as binary "indirectly supervised" training data. Experiments show that binary data helps in improving performance on three different NLP structure learning problems.

Main Approach

The authors present their approach, which they call "Joint Learning with Indirect Supervision", or J-LIS, as a generalization of the structured output | SVMs. The basic idea is to learn from a small number of fully or directly supervised training examples, and augment training with indirectly supervised (binary labels) training data.

The goal of learning in standard structured output prediction is to find a weight vector w such that . In this case, x is the input, (x) is the set of all feasible structures with x as input, and is a feature generation function. The key assumption that is used so that we can incorporate indirect supervision is that an input x generates a valid output () if and only if its best structure is well-formed, and conversely an input x generates an invalid output () if every structure for that input is bad. In mathematical terms:

where refer to the positive and negative partitions of the indirect supervision training set.

In their learning algorithm, the authors build off of the standard structural SVM (link this page), which formulates learning as a minimization problem: , where S is the set of fully supervised training examples, and is the loss function. The loss function is the same as in the original structural SVM paper ( Required Reading for 09/29/2011 class)

They expand upon this by adding an additional term to take into account the learning from the binary/indirect supervision training data: , where is the loss for the binary prediction. is defined as: . is simply the feature vector normalized according to the input.

The authors then present the optimization algorithm they use to solve the learning problem formulated above. They first discuss the convexity properties of the objective function, and note that the summation in the third term that is associated with the indirect supervision can be broken down into two summations, and . The second term in this breakdown, where we sum over the positive binary labels, is problematic because of the definition of a valid output: we need to find the best structure, and check if it is well-formed, and this maximization (finding the best structure) results in non-convex properties. To get around this problem, the authors modify the Concave-Convex procedure (CCCP) presented by Yuille & Rangarajan, Neural Computation 2003. The basic idea is, for a fixed weight vector, one can define a convex approximation to . We have to recompute this approximation every time we update our weight vector, i.e. at every iteration. At every iteration, the new convex problem is optimized in the dual using coordinate descent.

Baseline & Results

The approach was applied to three problems:

  • Named Entity Transliteration
  • POS Tagging
  • Information Extraction

The baseline was the structured SVM.

Named Entity Transliteration

This task is also known as phonetic transliteration alignment, and its main goal is to find the best phonetic alignment between the character sequences of two named entities. For example, the named entity 'Italy' in English' would be 'Italia' in Italian, and in order to recognize that two named entities in two different languages correspond to the same concept, one would need to train a cross-lingual phonetic alignment model. The dataset they used is available online [2], and consists of 300 Hebrew-English named entity pairs, with manual annotations of the alignments between the NE segments. the training/test split was 100/200 pairs. In addition, they augmented this data with 1,000 positive and 10,000 negative pairs, obtained from web crawls. They varied the number of fully annotated, positive, and negative supervised points and report F1 scores: F1net.png

POS Tagging

The POS data is from the WSJ corpus. 25,600 tokens (1000 sentences) were used for training and testing. From a separate set of 25,600 tokens, 2000 indirect supervision (1000 positive, 1000 negative) examples were created. No dictionary was used. Results in terms of POS accuracy are reported. Indirect supervision really helps when the fully annotated corpus is small.

Posresults.png

Information Extraction

Two datasets were used here:

300 fully annotated examples for training, 100 for development, and 100 for test.

  • Craigslist advertisements data for housing (e.g., size, rent, neighborhood). This dataset is also available online[3]

100 fully annotated examples for training, same for dev and test.

Positive and negative data consisted of 1000 examples each (50k tokens for citations, 200k for advertisements). Token level accuracy is measured, where the number of fully annotated examples is varied. Some decent results here too: J-LIS with 500 examples outperforms structured SVM with 1000 examples.

Ieresults.png

Related Work

  • Contrastive Estimation paper link is fairly similar to this work, and the authors dedicate a lot of discussion to comparisons. CE is essentially J-LIS without any fully annotated examples. CE cannot be applied when the relationship between good and bad examples is not known (unlike J-LIS), and also needs to marginalize over all possible hidden structures (J-LIS simply searches for the best structure). In terms of performance, they compare CE, J-LIS, and EM, and J-LIS falls somewhere in between EM and CE, primarily because J-LIS only finds the single best structure, which without good initialization may cause J-LIS to commit to a structure too early. This was done on J-LIS without any fully labeled examples, hence the lack of initialization. With just a couple of labeled examples, J-LIS significantly outperforms CE.