Nschneid writeup of McCallum 2000
The MEMM paper. MEMMs are essentially linear-chain CRFs, but each state is conditioned on its observation and the previous state (whereas a CRF state is conditioned on all observations). Does a nice job of arguing for the advantages of conditioning on observations and for allowing overlapping features. Clear presentation of modified forward-backward, training with GIS, and some variants (Baum-Welch EM for semi-supervised case, and a reinforcement learning variant). Does not address the label bias problem.
Experiments on a FAQ segmentation-classification task with a small corpus show the MEMM fares better than several HMM variants. I would have liked to see more experiments, however, such as with POS tagging and NER.