Liuy writeup of Collins 2002
This is a review of Collins 2002 by user:Liuy.
This paper proposes an algorithm based on Viterbi decoding of training examples. The update rule is simple additive. The paper suggests the voted perceptron, where every parameter setting get a vote for the output, and the majority wins. For the suggested algorithm, the main results in the papers are derived by a modification of the mistake bound proof of Perceptron. The main modification is they introduce a term D_{U, \delta} : a measure of how close U is to separating the training data with margin \delta. In the case that the training data is close to being separable with margin \delta, then the mistake bound is very close to the one by Perceptron.
I like the work due to the elegant modifications they made on the existing algorithm and the associate theoretical justification. One of the most meaningful discoveries of the paper is that the number of mistakes is independent of the number of candidate, but is dependent and only dependent on separation of the training data. We know NLP tasks often have exponential number of candidates, with respect to the number of inputs. And if the number of mistakes solely depends on the separability rather than the number of candidates. It will save a lot of computation. The mistake bounds in the non-realizable case (for example with bounded noise or uniform noise), and the generalization error bound, are also a modification of those of the Perceptron algorithm, by introducing the term D_{U, \delta}.
However, I am concerned about the following limitations of the paper: 1. the i.i.d assumption (the training examples and testing examples that are drawn from an unknown distribution over the set X \times Y), is a non-realistic assumption in practice. There is no reason to assume training data are drawn from the same distribution as the testing points are drawn. And this assumption is critical in the obtaining the convergence results. for instance, Knn does not have uniform convergence results in non i.i.d. case.
2. The evaluation compares the maximum-entropy model, the perceptron algorithm (with averaged and unaveraged parameter vectors) on pos tagging and base noun phrase chunking. I do not think the combination of different parameters us has been fully explored. Although some optimization on the free parameters are done, the choice of range of parameters is based on heuristics.
3. How the features in Figure 3 are obtained from the raw data is not clear to me.