Difference between revisions of "Structured Output Learning with Indirect Supervision"

From Cohen Courses
Jump to navigationJump to search
Line 25: Line 25:
 
They expand upon this by adding an additional term to take into account the learning from the binary/indirect supervision training data:
 
They expand upon this by adding an additional term to take into account the learning from the binary/indirect supervision training data:
 
<math> \min_{\mathbf{w}} \frac{||\mathbf{w}||^2}{2} + C_1 \sum_{i \in S} L_S (\mathbf{x}_i, \mathbf{h}_i, \mathbf{w}) + C_2 \sum_{i \in B} L_B(\mathbf{x}_i, y_i, \mathbf{w}) </math>, where <math>L_B</math> is the loss for the binary prediction.  <math>L_B</math> is defined as:
 
<math> \min_{\mathbf{w}} \frac{||\mathbf{w}||^2}{2} + C_1 \sum_{i \in S} L_S (\mathbf{x}_i, \mathbf{h}_i, \mathbf{w}) + C_2 \sum_{i \in B} L_B(\mathbf{x}_i, y_i, \mathbf{w}) </math>, where <math>L_B</math> is the loss for the binary prediction.  <math>L_B</math> is defined as:
<math>L_B (\mathbf{x}_i, y_i, \mathbf{w}) = \mathcal{l} \left(1-y_i \max_{\mathbf{h} \in \mathcal{H}(\mathbf{x})} (\mathbf{w}^T \Phi_B(\mathbf{x}_i, \mathbf{h}))\right)</math>.  <math>\Phi_B</math> is simply the feature vector normalized according to the input.  
+
<math>L_B (\mathbf{x}_i, y_i, \mathbf{w}) = \ell \left(1-y_i \max_{\mathbf{h} \in \mathcal{H}(\mathbf{x})} (\mathbf{w}^T \Phi_B(\mathbf{x}_i, \mathbf{h}))\right)</math>.  <math>\Phi_B</math> is simply the feature vector normalized according to the input.  
  
The authors then present the optimization algorithm they use to solve the learning problem formulated above.   
+
The authors then present the optimization algorithm they use to solve the learning problem formulated above.  The authors first discuss the convexity properties of the objective function, and note that the summation in the third term that is associated with the indirect supervision can be broken down into two summations, <math>G_1 = C_2 \sum_{i \in B^-} L_B(\mathbf{x}_i, y_i, \mathbf{w}) </math> and <math>G_2 = C_2 \sum_{i \in B^+} L_B(\mathbf{x}_i, y_i, \mathbf{w})</math>.  The second term in this breakdown, where we sum over the positive binary labels, is problematic because of the definition of a valid output: we need to find the best structure, and check if it is well-formed, and this maximization (finding the best structure) results in non-convex properties. To get around this problem, the authors modify the Concave-Convex procedure (CCCP) presented by Yuille & Rangarajan, Neural Computation 2003.  The basic idea is, for a fixed weight vector, one can define a convex approximation to <math>G_2</math>.  We have to recompute this approximation every time we update our weight vector, i.e. at every iteration.  At every iteration, the new convex problem is optimized in the dual using coordinate descent.
  
 
== Baseline & Results ==  
 
== Baseline & Results ==  
 +
 +
The approach was applied to three problems:
 +
* Named Entity Transliteration
 +
* POS Tagging
 +
* Information Extraction
  
 
== Related Work ==
 
== Related Work ==

Revision as of 19:26, 29 September 2011

Citation

Structured Output Learning with Indirect Supervision, by M.W. Chang, V.Srikumar, D.Goldwasser, D.Roth. In Proceedings of the 27th International Conference on Machine Learning, 2010.

This Paper is available online [1].

Summary

The problem with a lot of structured output problems is that is often time-consuming and difficult to obtain labelings and annotated structures for training. For example, one may need to obtain a parse tree, a POS tag sequence, or a translation for every given sentence in your training data. Naturally, this becomes more onerous the larger your training set. The key idea of this paper is that often, structured output problems have a companion learning problem which is to determine whether a given structure is legitimate or not. For example, for POS tagging, the binary companion problem is to determine whether the POS tag sequence for a given input sentence is valid or not. The basic assumption (which seems quite realistic to me) is that "direct supervision", which in this case would mean the ground truth tag sequence, is difficult and cumbersome to obtain, obtaining binary annotations (good/bad, legitimate/illegitimate) is far easier.

This paper presents a large margin-based framework that learns on both fully supervised (i.e., traditional supervised training) data as well as binary "indirectly supervised" training data. Experiments show that binary data helps in improving performance on three different NLP structure learning problems.

Main Approach

The authors present their approach, which they call "Joint Learning with Indirect Supervision", or J-LIS, as a generalization of the structured output | SVMs. The basic idea is to learn from a small number of fully or directly supervised training examples, and augment training with indirectly supervised (binary labels) training data.

The goal of learning in standard structured output prediction is to find a weight vector w such that . In this case, x is the input, (x) is the set of all feasible structures with x as input, and is a feature generation function. The key assumption that is used so that we can incorporate indirect supervision is that an input x generates a valid output () if and only if its best structure is well-formed, and conversely an input x generates an invalid output () if every structure for that input is bad. In mathematical terms:

where refer to the positive and negative partitions of the indirect supervision training set.

In their learning algorithm, the authors build off of the standard structural SVM (link this page), which formulates learning as a minimization problem: , where S is the set of fully supervised training examples, and is the loss function. The loss function is the same as in the original structural SVM paper ( Required Reading for 09/29/2011 class)

They expand upon this by adding an additional term to take into account the learning from the binary/indirect supervision training data: , where is the loss for the binary prediction. is defined as: . is simply the feature vector normalized according to the input.

The authors then present the optimization algorithm they use to solve the learning problem formulated above. The authors first discuss the convexity properties of the objective function, and note that the summation in the third term that is associated with the indirect supervision can be broken down into two summations, and . The second term in this breakdown, where we sum over the positive binary labels, is problematic because of the definition of a valid output: we need to find the best structure, and check if it is well-formed, and this maximization (finding the best structure) results in non-convex properties. To get around this problem, the authors modify the Concave-Convex procedure (CCCP) presented by Yuille & Rangarajan, Neural Computation 2003. The basic idea is, for a fixed weight vector, one can define a convex approximation to . We have to recompute this approximation every time we update our weight vector, i.e. at every iteration. At every iteration, the new convex problem is optimized in the dual using coordinate descent.

Baseline & Results

The approach was applied to three problems:

  • Named Entity Transliteration
  • POS Tagging
  • Information Extraction

Related Work