# Cohen and Carvalho, 2005

## Citation

Cohen, W. W., & Carvalho, V. (2005). Stacked sequential learning. In Proceedings of the international joint conference on artificial intelligence (IJCAI).

[[1]]

## Summary

This paper introduces a novel schema for stacked sequential learning. Stacked sequential learning is a meta-learning technique that takes an arbitrary base learner and improves its performance by making it aware of nearby examples. In the experiments they have shown that stacked sequential technique improves performance of both sequential (e.g. conditional random field) and non-sequential algorithms. This technique can be applied on any base learner and imposes only a constant overhead in training time.

This technique is evaluated on the problem of recognizing the signature section of an email. The dataset contains 617 emails. Each email is represented by a vector of feature vectors. Each element of this vector is a feature vector for each line of the email. A line is labeled as positive if it is part of the email signature, otherwise it is labeled as negative. Hand-crafted features (e.g. "line is blank") are used to represent each line of the email. The dataset totally contains 33,013 lines which about 10% are labeled as positive. They have also tested their technique on additional segmentation tasks such as classifying lines from FAQ documents, video segmentation, etc.

Given a sample S, the algorithm first segments S into K equal-sized disjoint subsets (${\displaystyle S_{1},...,S_{k}}$) and learn K functions ${\displaystyle f_{1},...,f_{k}}$. Each function ${\displaystyle f_{i}}$ is trained on all the data except the i'th subset ${\displaystyle S_{i}}$. We then construct set ${\displaystyle S'=\{(x_{t},y'_{t}):y'_{t}=f_{j}(x_{t})\ and\ x_{t}\in S_{j}\}}$. This is the basis of sequential stacking algorithm. Set ${\displaystyle S'}$ is then used to create a new dataset of extended instances. An extended instance in a simplest case is a vector composed of an instance ${\displaystyle (x_{i},y'_{i-1})}$ where ${\displaystyle y'_{i-1}}$ is the (i-1)-th label in ${\displaystyle y'}$.

In the initial results they have shown that stacked sequential learning can reduce the error of Maximum Entropy (ME) technique from 3.20% to 2.63%. They have also shown that they can achieve error rate of 0.71% by choosing window size of 20 in s-ME technique. Comparing to error rate of CRF which is 1.17% this is a statistically significant improvment.