Kuksa and Qi, SIAM 2010

From Cohen Courses
Revision as of 19:43, 31 October 2010 by PastStudents (talk | contribs)
Jump to navigationJump to search

Citation

Kuksa, P. and Yanjun, Q. Semi-supervised bio-named entity recognition with word-codebook learning. Proceedings of the Tenth SIAM International Conference on Data Mining. 2010.

Online Version

https://zeno.siam.org/proceedings/datamining/2010/dm10_003_kuksap.pdf

Summary

This paper describes an approach for generating new features in an unsupervised or a semi-supervised learning manner to improve named entity recognition for biomedical texts. Biomedical research texts tend to have many rare entities which are not present in a training set, making dictionary features less useful. The technique described here, word-codebook learning (WCL), leverages unlabeled text data to determine which words are likely to appear in which contexts, ie what phrases are natural. Words occuring in similar contexts are then essentially clustered. These clusters are then each represented by a single vector, the word code. These word codes constitute a codebook which is used as an additional input feature to a classification model.

One WCL approach shown, the Language Model approach, was completely unsupervised and focused on grouping words by their similar contexts. A multi-layer neural network is trained that first maps words into a vector space (embedding step). A sliding window of words is then fed as inputs through the layers of the network to create a scalar output. The network is trained to minimize an objective function which penalizes when a phrase (the window of words with the key word in the middle) is scored similarly to a modified phrase where the key word is replaced by a random one. These provide the positive and negative examples for training the ANN.

The other approach, Self Learned Label Patterns (SLLP), is semi-supervised, and tries to estimate the probability of a word having a certain label, given it is the middle word in a sliding window. This approach is more task-oriented than the fully unsupervised approach. The probabilities are estimated using unlabeled data, with the class labels guessed using a classifier trained on a different labeled set. This is insufficient for rare words, so context words are then included. These context words can be very informative, such as in "high --expression-- of --p51a--". To include these, the model is augmented either by looking for the presence of significant nearby words (boundary model) or by looking at the labels around the word in question (n-gram model). The boundary model looks only at one neighbor word at a time, whereas the n-gram model looks at several. Similar to the LM model above, the result is a feature vector, in this case, of class probabilities for each word.

The feature vectors for all words are then clustered, and a cluster label is chosen. This codeword is provided as an additional feature input for an NER classifier.

The method was tested on the BioCreativeII corpus. The extra features were generated from PUBMED abstracts and were then input along with fairly minimal morphological and contextual features (no dictionaries or POS) into a CRF classifier. Their method outperformed a base-line CRF model which used self-training rather than WCL, and it performed comparably to other entrants to the competition which used much more complex feature sets and hierarchy in their models. The greatest improvements are shown for short gene names (which are likely rarer) and when using multiple codebook approaches (LM model and SLLP models combined).

Related Papers

Collobert, ICML 2008 explore the advantages of using deep artificial neural-networks vs shallow classifiers like SVMs for handling various NLP tasks in a unified framework. Their framework was adapted for this paper's approach.

Mann, ACL 2008 explore using semi-supervised learning to extract useful information from features (as opposed to samples) as well, but do so in a generalized expectations framework.