# Stacked Sequential Learning

This is a meta-learning method that deals with the mismatch between training and testing data for sequential models, proposed in Cohen and Carvalho, 2005. It stacks two stages of prediction, where the second stage makes use of the results of the first stage.

## Motivation

Consider the general form of sequential prediction, in which we need to predict the label sequence ${\displaystyle \mathbf {y} =(y_{1},\ldots ,y_{n})}$ given the observation sequence ${\displaystyle \mathbf {x} =(x_{1},\ldots ,x_{n})}$. The prediction of one label ${\displaystyle y_{i}}$ will depend on neighboring labels, typically ${\displaystyle y_{i-1}}$ and ${\displaystyle y_{i+1}}$. During training, we have the true neighboring labels; but during testing, ${\displaystyle y_{i}}$ will be predicted based on the predicted neighboring labels. Due to reasons such as assumptions made by the model that do not exactly match the reality, there will be a mismatch between the distribution of the true and predicted neighboring labels, and this mismatch can result in degraded performance.

The solution is a two-stage approach: in the first stage, we train a base classifier using predicted labels instead of true labels; in the second stage, we train another classifier that learns from the mistakes made by the first classifier. The predicted labels for the training data are obtained with cross validation.

## Graphical Model Representation

The spirit of stacked sequential learning is best visualized with a graphical model representation. The following figure shows a stacked maxent model. In the first stage, a predicted label ${\displaystyle {\hat {y}}_{i}}$ is generated for each observation ${\displaystyle x_{i}}$ separately. In the second stage, the final label ${\displaystyle y_{i}}$ is generated from the predicted label ${\displaystyle {\hat {y}}_{i}}$ for that observation as well as neighboring predicted labels ${\displaystyle {\hat {y}}_{i-1}}$ and ${\displaystyle {\hat {y}}_{i+1}}$. Note that the base classifiers (maxent classifiers) are not even sequential; the "sequentialness" of the stacked model is provided by the links from ${\displaystyle {\hat {y}}_{i-1}}$ and ${\displaystyle {\hat {y}}_{i+1}}$ to ${\displaystyle y_{i}}$.

## Variations

• Any type of base classifier ${\displaystyle A}$ can be used. A stacked maxent classifier is shown above, and below is a stacked CRF.
• Although in the algorithm above the two stages use the same base classifier ${\displaystyle A}$, they do not have to be the same. For example, the figure below shows a stacked model where the first stage is a CRF and the second stage is a maxent classifier.

• The window ${\displaystyle [-W_{h},W_{f}]}$ can be made arbitrarily large. For example, below is a stacked maxent classifier with ${\displaystyle W_{h}=W_{f}=2}$.

• The prediction for ${\displaystyle {\hat {y}}_{i}}$ and ${\displaystyle y_{i}}$ can also depend on multiple observations. For example, in Krishnan and Manning, 2006, both ${\displaystyle {\hat {y}}_{i}}$ and ${\displaystyle y_{i}}$ depend on three observations: ${\displaystyle x_{i-1}}$, ${\displaystyle x_{i}}$, and ${\displaystyle x_{i+1}}$.

• An aggregate layer ${\displaystyle z_{i}}$ can be inserted between ${\displaystyle {\hat {y}}_{i}}$ and ${\displaystyle y_{i}}$. The features ${\displaystyle z_{i}}$ are calculated across all ${\displaystyle {\hat {y}}_{i}}$ and are used by the second-stage classifier. This is also used in Krishnan and Manning, 2006.
• It is possible to stack three stages of classifiers, or even more.
• The idea of stacking is not restricted to sequential learning, but can also be applied to general graphical learning, as in Kou and Cohen, 2007.

## Time Complexity

During training, we need to train ${\displaystyle K+1}$ base classifiers for the first stage (${\displaystyle K}$ for cross validation and one trained with the entire corpus), and one base classifier for the second stage. This is ${\displaystyle O(K)}$ times the training time of one base classifier. In cases when training a first-stage classifier is really expensive, we can minimize the overhead by choosing ${\displaystyle K=2}$.

During testing, we only need to run the two classifiers once each. This is really efficient.

## Applications

• Cohen and Carvalho, 2005 applies stacked non-sequential maxent classifiers and stacked CRFs are to a Sequence Partitioning problem (identifying the signature section of emails) and a Sequence Classification problem (classifying music as happy or sad). Results show that stacked classifiers outperform their non-stacked counterparts, and stacked maxent classifiers even outperform non-stacked CRFs.
• Krishnan and Manning, 2006 uses stacked CRFs to model non-local dependencies for Named Entity Recognition. This paper uses the variation with an aggregate layer. This layer calculates the most frequent predicted label for each entity, which is used as an input feature for the second stage. The second stage then uses this information to encourage difference occurrences of the same entity to have identical labels.