Sgardine writesup Cohen 2005 stacked sequential learning

From Cohen Courses
Jump to navigationJump to search

This is a review of Cohen_2005_stacked_sequential_learning by user:sgardine.

Summary

MEMMs are surprisingly unsuccessful at sequencing tasks (paradigmatically signature detection in emails) at least in part because the training data contains extremely high signal for a run of tags to continue once it has begun; the test data in contrast might indicate low-confidence transitions which would then become part of the highly-weighted history. Stacking involves first learning K learners with a K-fold cross-validation partition of the training data. An overall learner is then trained on an extended dataset formed by combining each training instance with the appropriate learner's predictions for that instance; the prediction's influence on the extended instance is configurable by controlling the amount of window available. Stacked ME learners outperformed unstacked ME, MEMMs and (unstacked) CRFs for sequential tasks; stacked CRFs also outperformed the unstacked algorithms. Large windows helped the stacked learners considerably.

Commentary

The classification threshold θ of the MEMM is discussed, and a lower-bound on MEMM error is found by using threshold θ* which minimizes classification error on the test set (a kind of oracle). The suboptimal choice of θ used otherwise is not discussed; I especially noticed while wondering why the soboptimality of the MEMM in the noise-added scenario dropped to zero.

I liked the discussion of the training/test mismatch issue and its particular importance in NL partitioning tasks. The discussion provides much more natural motivation for stacking as the specific means of addressing the shortcomings of MEMM.

The description of stacking provided does not necessarily rely on the learning algorithm A being used for the f=A(S) step to be the same learning algorithm used for the f'=A(S') step of the algorithm. It could be interesting to compare different learning algorithms placed "atop" a fixed learning algorithm (even a fixed learned classifier) with its corresponding K cross-fold classifiers. It occurs to me as a special case that an interesting evaluation measure for stacking could be calculated by setting A atop an oracle classifier for the context.

Discussion about case with A being Naive Bayes and K=10

I would expect training a stacked-Naive Bayes learner to take about K+2=12 times as long as training a single Naive Bayes learner. We have to train K=10 individual classifiers, plus f (+1) on the entire dataset (the same learner we would have learned in the single case) and f' (+1) on the extended data set. We could optimize slightly by starting with one of our K classifiers and simply adding in the counts from its left-out partition. This tweak would leave us with still 11+ times the training time, so probably wouldn't be worth doing.

At inference time, I would expect the stacked-NB classifier to take twice as long as the single NB classifier -- once to classify the sequence with f, once to classify with f' based on the predictions of f.