Liuliu writeup of Collins

From Cohen Courses
Jump to navigationJump to search

This is a write up of Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms, Collins, EMNLP 2002 by user:Liuliu

This paper extends the original HMM by combining more features(by using feature vectors as in maximum-entropy models) and a discriminative training method(by using perceptron algorithm during viterbi decoding). It is similar to the perceptron algorithm paper[Freund and Schapire 1999] we discussed on this Monday. The training algorithm, updated function as well as the three theorems are all very similar. I'd like to summarize this paper by comparing it with [Freund and Schapire 1999]. Also I will compare them with boosting, as they also have something in common and it's so interesting to find out that Robert Schapire and Yoav Freund are also the ones who proposed the original boosting algorithm.

Differences

work on different problems

  • Freund and Schapire 1999: classification problem. Output is binary: either positive or negative.
  • Collins 2002: POS tagging and NP chunking. He also proposed a general frame to use perceptron algorithm in tagging and parsing.
  • Boosting: classification and regression.

what to update during training

  • Freund and Schapire 1999: prediction vector which is used to predict the sign of new instances. The predict vector will be changed according by the current training point x (+y_i*x_i or -y_i*x_i)
  • Collins 2002: parameter vectors for maximum-entropy taggers. The parameters will be changed according to the differences between real feature vector counts and predicted feature vector counts. Here, he has a richer representation for an training instance - feature vectors.
  • Boosting: the weight of training examples but not the learners.

Similarities

The three algorithm all learn the best parameters in an iterative way.

Other points

Both Freund and Schapire 1999 and Collins 2002 proves that number of mistakes only depends on the margin size, which might explain why this method is so efficient compared with traditional estimation method used by Maximum-Entropy tagger. I also have some questions regarding the experiment results: why ME suffers when all features are included? A deeper exploration of how ME uses GIS to estimate parameters might help. Overall, I like this paper. I like the way he transfers perceptron algorithm into NLP problems.