Nlao writeup of Collins 2002

From Cohen Courses
Jump to navigationJump to search

This is a review of Collins 2002 by user:Nlao.

Let x,y,y_hat, be the sequence of observation, sequence of true label and sequence of predicted label. The proposed algorithm is maximizing the following discriminative objective function

O(w)=logP(y|x;w)-logP(y_hat|x;w)

since y_hat=argmax logP(y'|x;w) the above function is not differentiable everywhere.

Here, Collins solve O(w) with simple perceptron updates and parameter avaraging, therefore the method has linear convergence with implicit regularization. The stochastic update converges fast and is good for online learning. However, second order convergence can be very preferable if the O(w) has very different curvature on different axis.

With more complex implementation, we can use second order optimization methods with explicity regularization. Like in M3N (Taskar et al., 2003; Jun & Xing 2009).

[minor points]