Sgardine writesup Freund and Schapire

From Cohen Courses
Jump to navigationJump to search

This is a review of Freund_1999_large_margin_classification by user:sgardine.

Summary

This paper presents a relatively simple algorithm which provably learns linearly separable boundaries with bounded error. The voted-perceptron trains a prediction vector on one or more passes through the training data, storing each intermediate learned vector and weighting it with the number of examples it classified correctly, then using the weighted majority vote of the classification vectors at classification time. Some theorems show that the voted-perceptron algorithm has error bounded by an expression which depends on the training set size, the size of the margin, and the radius of the ball (centered at the origin) containing the training set, but, crucially, not on the dimensionality of the space. For data which are not linearly separable, a kernel (e.g. the polynomial expansion kernel used here) may be used to (implicitly) map the vectors into a higher dimensional space where the instances may be linearly separable. The perceptron was trained on digit-recognition data with different values of d for the polynomial expansion, and with several aggregation functions for the perceptron predictions: voted, averaging the results, using the last (i.e. single perceptron case) and using a random vector (chosen i.i.d. from a distribution weighted like the voting weights). Voting and Averaging performed the best especially with d>1 providing high enough dimension to make the data linearly separable; all algorithms underperformed SVM, albeit at lower implementation cost.

Commentary

I don't quite understand the authors' surprise (stated several times) that the voted-perceptron algorithm continues to improve for T>1. It seems to me they have an upper bound for T→∞ (Theorem 4) and an upper bound for T=1 (Corollary 2), both of which depend on k; furthermore, as they observe k is smaller for T=1, so the upper bound is in fact tighter in that special case. But it is consistent with this theory to observe that either or both classifiers are performing much better than their upper bounds in practice (It's like if the oracle told you that Joe's age was under 150 and his mother's under 120, and you concluded Joe was likely older than his mother). In fact, common sense would suggest that the perceptron would continue to improve on subsequent passes through the data before the overfitting of the convergence point (which the voting setup was designed to counteract) overpowers other contributions. I would expect that some relatively small T would suffice to get a good model which does not overfit the training data, but am unsurprised that T=1 is too small. Their data seems to support this, as the voted perceptron seems to stop improving by (at most) T=10 or so.

I must also be missing something in the conclusion that the number of mistakes does not depend on the dimensionality of the problem: if x∈ℜn and z∈ℜm with m>n, and all components are equally likely to be nonzero (i.e. without sparsity constraints) then on average ||z||>||x||, and thus radius of the ball enclosing datapoints (R in Theorems 1 on) in a larger dimensional space will increase. Put another way, the longest diagonal of the unit hypercube increases as the square root of the dimensionality. Nor do the results of Vapnik alluded to (page 9) address this since the components of a higher-dimensional vector still have more opportunities to disagree with the optimal vector. I'm guessing I'm missing something.

I liked the very clear discussion of the use of kernels and the note about failing to do so (i.e. d=1) causing the algorithm to proceed very slowly (because the data are not linearly separable).