Stacked Sequential Learning

This is a meta-learning method that deals with the mismatch between training and testing data for sequential models, proposed in Cohen and Carvalho, 2005. It stacks two stages of prediction, where the second stage makes use of the results of the first stage.

Motivation

Consider the general form of sequential prediction, in which we need to predict the label sequence $\mathbf {y} =(y_{1},\ldots ,y_{n})$ given the observation sequence $\mathbf {x} =(x_{1},\ldots ,x_{n})$ . The prediction of one label $y_{i}$ will depend on neighboring labels, typically $y_{i-1}$ and $y_{i+1}$ . During training, we have the true neighboring labels; but during testing, $y_{i}$ will be predicted based on the predicted neighboring labels. Due to reasons such as assumptions made by the model that do not exactly match the reality, there will be a mismatch between the distribution of the true and predicted neighboring labels, and this mismatch can result in degraded performance.

The solution is a two-stage approach: in the first stage, we train a base classifier using predicted labels instead of true labels; in the second stage, we train another classifier that learns from the mistakes made by the first classifier. The predicted labels for the training data are obtained with cross validation.

Algorithm

Graphical Model Representation

The spirit of stacked sequential learning is best visualized with a graphical model representation. The following figure shows a stacked maxent model. In the first stage, a predicted label ${\hat {y}}_{i}$ is generated for each observation $x_{i}$ separately. In the second stage, the final label $y_{i}$ is generated from the predicted label ${\hat {y}}_{i}$ for that observation as well as neighboring predicted labels ${\hat {y}}_{i-1}$ and ${\hat {y}}_{i+1}$ . Note that the base classifiers (maxent classifiers) are not even sequential; the "sequentialness" of the stacked model is provided by the links from ${\hat {y}}_{i-1}$ and ${\hat {y}}_{i+1}$ to $y_{i}$ .

Variations

Any type of base classifier $A$ can be used. A stacked maxent classifier is shown above, and below is a stacked CRF.

Although in the algorithm the classifiers in the two stages $f$ and $f'$ are trained with the same algorithm $A$ , they do not have to be the same. For example, the figure below shows a stacked model where the first stage is a CRF and the second stage is a maxent classifier.

The window $[-W_{h},W_{f}]$ can be made arbitrarily large. For example, below is a stacked maxent classifier with $W_{h}=W_{f}=2$ .

The prediction for ${\hat {y}}_{i}$ and $y_{i}$ can also depend on multiple observations. For example, in Krishnan and Manning, 2006, both ${\hat {y}}_{i}$ and $y_{i}$ depend on three observations: $x_{i-1}$ , $x_{i}$ , and $x_{i+1}$ .

An aggregate layer $z_{i}$ can be inserted between ${\hat {y}}_{i}$ and $y_{i}$ . The features $z_{i}$ are calculated across all ${\hat {y}}_{i}$ and are used by the second-stage classifier.

It is possible to stack three stages of classifiers, or even more.

Time Complexity

During training, we need to train $K+1$ base classifiers for the first stage ( $K$ for cross validation and one trained with the entire corpus), and one base classifier for the second stage. This is $O(K)$ times the training time of one base classifier. In cases when training a first-stage classifier is really expensive, we can minimize the overhead by choosing $K=2$ .

During testing, we only need to run the two classifiers once each. This is really efficient.

Applications

Cohen and Carvalho, 2005 applies stacked non-sequential maxent classifiers and stacked CRFs are to a Sequence Partitioning problem (identifying the signature section of emails) and a Sequence Classification problem (classifying music as happy or sad). Results show that stacked classifiers outperform their non-stacked counterparts, and stacked maxent classifiers even outperform non-stacked CRFs.

Krishnan and Manning, 2006 uses stacked CRFs to model non-local dependencies for Named Entity Recognition. This paper uses the variation with an aggregate layer. This layer calculates the most frequent predicted label for each entity, which is used as an input feature for the second stage. The second stage then uses this information to encourage difference occurrences of the same entity to have identical labels.

Kou and Cohen, 2007 generalizes stacked sequential learning to stacked graphical learning, in which the observations and labels do not need to form a sequence but can have arbitrary graphical relations. Stack graphical learning is applied to three problems: text region detection in images, Document Classification, and Named Entity Recognition.

Stacked Sequential Learning

Contents

Motivation

Algorithm

Graphical Model Representation

Variations

Time Complexity

Applications

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

Tools