Nschneid writeup of Cohen 2005
This is Nschneid's review of Cohen_2005_stacked_sequential_learning
Stacked learning for sequence-labeling tasks. (Same framework is used in (Martins et al. 2008, "Stacking Dependency Parsers") for trees.) Experiments demonstrate its effectiveness for partitioning tasks (segmentation where segments are generally long—labels are "sticky"). The authors argue this is due neither to label bias nor observation bias (cf. Klein & Manning 2002) but to a mismatch between training data, in which nearby labels are known, as opposed to test data, where nearby labels are (possibly incorrect) predictions.
- Some initial experiments on a named entity recognition problem suggest that sequential stacking does not improve performance on non-partitioning problems; however, in future work, we plan to explore this issue with more detailed experimentation.
Well, does it? :)
In the experiments in the paper, stacking seems to help a lot for MEMMs, but somewhat for CRFs as well. This is because with stacking, features over many labels are allowed which would be nonlocal in the original model—20 in some experiments vs. 2 allowed by the independence assumptions of the model (I take it these are bigram CRFs). It would have been nice to see a discussion of why estimating a higher-order CRF is intractable or likely to make smoothing difficult. All in all, though, stacking is a pretty clever trick.
- If you used stacked sequential learning with naive Bayes as the base learning and K=10, how many times slower would it be than just running the base learner?
- Let N be the size of the training data. As stated in the paper, when performing stacked learning with K cross-validation folds for the base model, K+2 models need to be trained, so the time required is about K+2 times that of the non-stacked model—assuming each training procedure requires approximately the same amount of time. Recall, however, that each cross-validation fold is trained with a Kth of the data held out from the input; so if we assume the time required by the base learner is some constant T times the size of the training data, we have: K*(T*N*(K-1)/K) time is required for the cross-validation folds, T*N time for fully training the base model, and T*N time for training the stacked model. Thus, the total time would be (K-1)+2 = K+1 times what would be required without stacking. (Not counting time required to decode with the cross-validation models so as to produce the extended dataset for training the higher-order model.)