Stacked Sequential Learning

From Cohen Courses
Jump to navigationJump to search

This is a meta-learning method that deals with the mismatch between training and testing data for sequential models, proposed in Cohen and Carvalho, 2005. It stacks two stages of prediction, where the second stage makes use of the results of the first stage.


Consider the general form of sequential prediction, in which we need to predict the label sequence given the observation sequence . The prediction of one label will depend on neighboring labels, typically and . During training, we have the true neighboring labels; but during testing, will be predicted based on the predicted neighboring labels. Due to reasons such as assumptions made by the model that do not exactly match the reality, there will be a mismatch between the distribution of the true and predicted neighboring labels, and this mismatch can result in degraded performance.

The solution is a two-stage approach: in the first stage, we train a base classifier using predicted labels instead of true labels; in the second stage, we train another classifier that learns from the mistakes made by the first classifier. The predicted labels for the training data are obtained with cross validation.


Stack Algorithm.png

Graphical Model Representation

The spirit of stacked sequential learning is best visualized with a graphical model representation. The following figure shows a stacked maxent model. In the first stage, a predicted label is generated for each observation separately. In the second stage, the final label is generated from the predicted label for that observation as well as neighboring predicted labels and . Note that the base classifiers (maxent classifiers) are not even sequential; the "sequentialness" of the stacked model is provided by the links from and to .

Stack MaxEnt MaxEnt.png


  • Any type of base classifier can be used. A stacked maxent classifier is shown above, and below is a stacked MEMM.

Stack MEMM MEMM.png

  • Although in the algorithm above the two stages use the same base classifier , they do not have to be the same. And the graphical model can be either directed or undirected. For example, the figure below shows a stacked model where the first stage is a CRF (undirected) and the second stage is a maxent classifier (directed).

Stack CRF MaxEnt.png

  • The window can be made arbitrarily large. For example, below is a stacked maxent classifier with .

Stack MaxEnt MaxEnt 2.png

  • The prediction for and can also depend on multiple observations. For example, in Krishnan and Manning, 2006, both and depend on three observations: , , and .

Stack Multiple X.png

  • An aggregate layer can be inserted between and . The features are calculated across all and are used by the second-stage classifier. This is also used in Krishnan and Manning, 2006.

Stack Aggregate Layer.png

  • It is possible to stack three stages of classifiers, or even more.
  • The idea of stacking is not restricted to sequential learning, but can also be applied to general graphical learning, as in Kou and Cohen, 2007.

Time Complexity

During training, we need to train base classifiers for the first stage ( for cross validation and one trained with the entire corpus), and one base classifier for the second stage. This is times the training time of one base classifier. In cases when training a first-stage classifier is really expensive, we can minimize the overhead by choosing .

During testing, we only need to run the two classifiers once each. This is really efficient.


  • Cohen and Carvalho, 2005 applies stacked non-sequential maxent classifiers and stacked CRFs are to a Sequence Partitioning problem (identifying the signature section of emails) and a Sequence Classification problem (classifying music as happy or sad). Results show that stacked classifiers outperform their non-stacked counterparts, and stacked maxent classifiers even outperform non-stacked CRFs.
  • Krishnan and Manning, 2006 uses stacked CRFs to model non-local dependencies for Named Entity Recognition. This paper uses the variation with an aggregate layer. This layer calculates the most frequent predicted label for each entity, which is used as an input feature for the second stage. The second stage then uses this information to encourage difference occurrences of the same entity to have identical labels.