Krishnan 2006 an effective two stage model for exploiting non local dependencies in named entity recognition
Krishnan, V. and Manning, C. D. 2006. An Effective Two-Stage Model for Exploiting Non-Local Dependencies in Named Entity Recognition. In ACL-COLING’06: Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics.
This paper presents a simple and efficient two-stage Stacked Sequential Learning approach that captures non-local dependencies in Named Entity Recognition (NER). The non-local dependencies that the authors try to handle here are that similar or same tokens(or token sequences) are more likely to have the same label. Directly modeling/capturing such non-local dependencies are difficult because assuming such dependencies make the inference harder. The proposed method does not try to capture non-local dependencies directly but in a two-stage way. In the first stage, a conventional sequence CRF model is used to approximate aggregate statistics of labels. Then another CRF model is used, using a function of those approximated aggregate statistics of labels as its features. For a given token/entity(labeled by the first CRF), it tries to encourage the majority label assigned to (1) the same token, (2) the same entity, and (3) entities whose token sequence includes the current token sequence either (a) in the same document or (b) in the corpus. This method, when tested against previous models that tried to capture non-local dependencies directly, achieved a higher relative error reduction.
Brief description of the method
There are two Conditional Random Fields in this method. The first is a conventional linear CRF. The local dependencies used here are tokens before and after. The authors use features known to be effective in NER, such as the current, previous, and next words, character n-grams of the current word, etc. For the full list of features, check the appendix.
The second CRF uses features that is a function of an aggregate statistics of labels obtained from the first CRF. Specifically, they are the following 3 types:
1. Token-majority features: these features refer to the majority label assigned to the particular token in the document/corpus. These capture dependencies between similar token sequences, especially token sequences that have common words between them.
2. Entity-majority features: these features refer to the majority label assigned to the particular entity (labeled by the first CRF) in the document/corpus. If the token was labeled not as a named entity in the first CRF, the feature returns the majority label assigned to a single-token named entity with the current token. These capture dependencies between the same token sequences.
3. Superentity-majority features: these features refer to the majority label assigned to the supersequences of the particular entity in the document/corpus. If the token was labeled not as a named entity in the first CRF, the feature returns the majority label assigned to all entities containing the token. These capture dependencies between between superentity and subentity. It makes sense to use only superentities as features because longer token sequence gives you more contextual cue.
One thing to note is that this model also tries to capture non-local dependencies at the corpus level. This could not be done in the previous methods because they try to capture non-local dependencies directly, which makes the inference intractable when corpus level dependencies are added. Also, while the method proposed in Finkel et al, ACL 2005 increased the running time by a factor of 30 over the sequential CRF, this method only takes time for two sequential CRFs.
When training this model we need the predictions of the first CRF on the training data. To prevent overfit, these predictions are made by doing a 10-fold cross validation. However all the training data is used to train the CRF when testing.
The authors tested this method on CoNLL'03 English named entity recognition dataset. Their baseline Conditional Random Fields achieved an already competitive result of F-measure of 85.29. Adding document level non-local dependencies achieved 12.6% relative error reduction over the baseline. Incorporating non-local dependencies across documents (at corpus level) as well achieved 13.3% relative error reduction. Also, despite the high baseline performance compared to other methods from Bunescu and Mooney, ACL 2004 and Finkel et al, ACL 2005, the proposed method managed to get higher relative error reduction. For more detailed result, check the table below.
Full list of features for the baseline CRF: the current, previous and next words, character n-grams of the current word, Part of Speech tag of the current word and surrounding words, the shallow parse chunk of the current word, shape of the current word, the surrounding word shape sequence, the presence of a word in a left window of size 5 around the current word and the presence of a word in a right window of size 5 around the current word.