Structured Output Learning with Indirect Supervision

Citation

Structured Output Learning with Indirect Supervision, by M.W. Chang, V.Srikumar, D.Goldwasser, D.Roth. In Proceedings of the 27th International Conference on Machine Learning, 2010.

This Paper is available online [1].

Summary

The problem with a lot of structured output problems is that is often time-consuming and difficult to obtain labelings and annotated structures for training. For example, one may need to obtain a parse tree, a POS tag sequence, or a translation for every given sentence in your training data. Naturally, this becomes more onerous the larger your training set. The key idea of this paper is that often, structured output problems have a companion learning problem which is to determine whether a given structure is legitimate or not. For example, for POS tagging, the binary companion problem is to determine whether the POS tag sequence for a given input sentence is valid or not. The basic assumption (which seems quite realistic to me) is that "direct supervision", which in this case would mean the ground truth tag sequence, is difficult and cumbersome to obtain, obtaining binary annotations (good/bad, legitimate/illegitimate) is far easier.

This paper presents a large margin-based framework that learns on both fully supervised (i.e., traditional supervised training) data as well as binary "indirectly supervised" training data. Experiments show that binary data helps in improving performance on three different NLP structure learning problems.

Main Approach

The authors present their approach, which they call "Joint Learning with Indirect Supervision", or J-LIS, as a generalization of the structured output | SVMs. The basic idea is to learn from a small number of fully or directly supervised training examples, and augment training with indirectly supervised (binary labels) training data.

The goal of learning in standard structured output prediction is to find a weight vector w such that $\mathbf {h} _{i}=\arg \max _{\mathbf {h} \in {\mathcal {H}}(\mathbf {x} )}\mathbf {w} ^{T}\Phi (\mathbf {x} _{i},\mathbf {h} )$ . In this case, x is the input, ${\mathcal {H}}$ (x) is the set of all feasible structures with x as input, and $\Phi$ is a feature generation function. The key assumption that is used so that we can incorporate indirect supervision is that an input x generates a valid output ( $y=1$ ) if and only if its best structure is well-formed, and conversely an input x generates an invalid output ( $y=-1$ ) if every structure for that input is bad. In mathematical terms:

$\forall \quad (\mathbf {x} ,-1)\in B^{-},\quad \forall \mathbf {h} \in {\mathcal {H}}(\mathbf {x} ),\mathbf {w} ^{T}\Phi (\mathbf {x} ,\mathbf {h} )\leq 0$
$\forall \quad (\mathbf {x} ,+1)\in B^{+},\quad \exists \mathbf {h} \in {\mathcal {H}}(\mathbf {x} ),\mathbf {w} ^{T}\Phi (\mathbf {x} ,\mathbf {h} )\geq 0$

where $B^{+},B^{-}$ refer to the positive and negative partitions of the indirect supervision training set.

In their learning algorithm, the authors build off of the standard structural SVM (link this page), which formulates learning as a minimization problem: $\min _{\mathbf {w} }{\frac {||\mathbf {w} ||^{2}}{2}}+C_{1}\sum _{i\in S}L_{S}(\mathbf {x} _{i},\mathbf {h} _{i},\mathbf {w} )$ , where S is the set of fully supervised training examples, and $L_{S}$ is the loss function. The loss function is the same as in the original structural SVM paper ( Required Reading for 09/29/2011 class)

They expand upon this by adding an additional term to take into account the learning from the binary/indirect supervision training data: $\min _{\mathbf {w} }{\frac {||\mathbf {w} ||^{2}}{2}}+C_{1}\sum _{i\in S}L_{S}(\mathbf {x} _{i},\mathbf {h} _{i},\mathbf {w} )+C_{2}\sum _{i\in B}L_{B}(\mathbf {x} _{i},y_{i},\mathbf {w} )$ , where $L_{B}$ is the loss for the binary prediction. $L_{B}$ is defined as: $L_{B}(\mathbf {x} _{i},y_{i},\mathbf {w} )={\mathcal {l}}\left(1-y_{i}\max _{\mathbf {h} \in {\mathcal {H}}(\mathbf {x} )}(\mathbf {w} ^{T}\Phi _{B}(\mathbf {x} _{i},\mathbf {h} ))\right)$ . $\Phi _{B}$ is simply the feature vector normalized according to the input.