Difference between revisions of "Kuksa and Qi, SIAM 2010"
PastStudents (talk | contribs) (Created page with '== Citation == Kuksa, P. and Yanjun, Q. Semi-supervised bio-named entity recognition with word-codebook learning. Proceedings of the Tenth SIAM International Conference on Data…') |
PastStudents (talk | contribs) |
||
(3 intermediate revisions by the same user not shown) | |||
Line 9: | Line 9: | ||
== Summary == | == Summary == | ||
− | This [[Category::paper]] describes an approach for generating new features in an unsupervised or a [[UsesMethod::semi-supervised learning]] manner to improve named entity recognition for biomedical texts. Biomedical research texts tend to have many rare entities which are not present in a training set, making dictionary features less useful. The technique described here, [[ | + | This [[Category::paper]] describes an approach for generating new features in an unsupervised or a [[UsesMethod::semi-supervised learning]] manner to improve [[AddressesProblem::named entity recognition]] for biomedical texts. Biomedical research texts tend to have many rare entities which are not present in a training set, making dictionary features less useful and requiring very careful selection of features and more complex models. The technique described here, [[UsesMethod::word-codebook learning]] (WCL), leverages unlabeled text data to glean information from the typical contexts of a given word or a target class. For instance, in the unsupervised approach, the method determines which words are likely to appear in which contexts, ie what phrases are natural. Words occurring in similar contexts are then essentially clustered. These clusters are then each represented by a single vector, the codeword. These codewords constitute a codebook which is used as an additional input feature to a classification model. WCL enables one to use much simpler models and yet maintain similar accuracy. |
One WCL approach shown, the Language Model approach, was completely unsupervised and focused on grouping words by their similar contexts. A multi-layer [[UsesMethod::neural network]] is trained that first maps words into a vector space (embedding step). A sliding window of words is then fed as inputs through the layers of the network to create a scalar output. The network is trained to minimize an objective function which penalizes when a phrase (the window of words with the key word in the middle) is scored similarly to a modified phrase where the key word is replaced by a random one. These provide the positive and negative examples for training the ANN. | One WCL approach shown, the Language Model approach, was completely unsupervised and focused on grouping words by their similar contexts. A multi-layer [[UsesMethod::neural network]] is trained that first maps words into a vector space (embedding step). A sliding window of words is then fed as inputs through the layers of the network to create a scalar output. The network is trained to minimize an objective function which penalizes when a phrase (the window of words with the key word in the middle) is scored similarly to a modified phrase where the key word is replaced by a random one. These provide the positive and negative examples for training the ANN. | ||
Line 17: | Line 17: | ||
The feature vectors for all words are then clustered, and a cluster label is chosen. This codeword is provided as an additional feature input for an NER classifier. | The feature vectors for all words are then clustered, and a cluster label is chosen. This codeword is provided as an additional feature input for an NER classifier. | ||
− | The method was tested on the BioCreativeII corpus. The extra features were generated from PUBMED abstracts and were then input along with fairly minimal morphological and contextual features (no dictionaries or POS) into a [[UsesMethod::CRF]] classifier. Their method outperformed a base-line CRF model which used self-training rather than WCL, and it performed comparably to other entrants to the competition which used much more complex feature sets and hierarchy in their models. The greatest improvements are shown for short gene names (which are likely rarer) and when using multiple codebook approaches (LM model and SLLP models combined). | + | The method was tested on the [[UsesDataset::BioCreativeII]] corpus. The extra features were generated from PUBMED abstracts and were then input along with fairly minimal morphological and contextual features (no dictionaries or POS) into a [[UsesMethod::CRF]] classifier. Their method outperformed a base-line CRF model which used self-training rather than WCL, and it performed comparably to other entrants to the competition which used much more complex feature sets and hierarchy in their models. The greatest improvements are shown for short gene names (which are likely rarer) and when using multiple codebook approaches (LM model and SLLP models combined). |
== Related Papers == | == Related Papers == |
Latest revision as of 18:56, 31 October 2010
Citation
Kuksa, P. and Yanjun, Q. Semi-supervised bio-named entity recognition with word-codebook learning. Proceedings of the Tenth SIAM International Conference on Data Mining. 2010.
Online Version
https://zeno.siam.org/proceedings/datamining/2010/dm10_003_kuksap.pdf
Summary
This paper describes an approach for generating new features in an unsupervised or a semi-supervised learning manner to improve named entity recognition for biomedical texts. Biomedical research texts tend to have many rare entities which are not present in a training set, making dictionary features less useful and requiring very careful selection of features and more complex models. The technique described here, word-codebook learning (WCL), leverages unlabeled text data to glean information from the typical contexts of a given word or a target class. For instance, in the unsupervised approach, the method determines which words are likely to appear in which contexts, ie what phrases are natural. Words occurring in similar contexts are then essentially clustered. These clusters are then each represented by a single vector, the codeword. These codewords constitute a codebook which is used as an additional input feature to a classification model. WCL enables one to use much simpler models and yet maintain similar accuracy.
One WCL approach shown, the Language Model approach, was completely unsupervised and focused on grouping words by their similar contexts. A multi-layer neural network is trained that first maps words into a vector space (embedding step). A sliding window of words is then fed as inputs through the layers of the network to create a scalar output. The network is trained to minimize an objective function which penalizes when a phrase (the window of words with the key word in the middle) is scored similarly to a modified phrase where the key word is replaced by a random one. These provide the positive and negative examples for training the ANN.
The other approach, Self Learned Label Patterns (SLLP), is semi-supervised, and tries to estimate the probability of a word having a certain label, given it is the middle word in a sliding window. This approach is more task-oriented than the fully unsupervised approach. The probabilities are estimated using unlabeled data, with the class labels guessed using a classifier trained on a different labeled set. This is insufficient for rare words, so context words are then included. These context words can be very informative, such as in "high --expression-- of --p51a--". To include these, the model is augmented either by looking for the presence of significant nearby words (boundary model) or by looking at the labels around the word in question (n-gram model). The boundary model looks only at one neighbor word at a time, whereas the n-gram model looks at several. Similar to the LM model above, the result is a feature vector, in this case, of class probabilities for each word.
The feature vectors for all words are then clustered, and a cluster label is chosen. This codeword is provided as an additional feature input for an NER classifier.
The method was tested on the BioCreativeII corpus. The extra features were generated from PUBMED abstracts and were then input along with fairly minimal morphological and contextual features (no dictionaries or POS) into a CRF classifier. Their method outperformed a base-line CRF model which used self-training rather than WCL, and it performed comparably to other entrants to the competition which used much more complex feature sets and hierarchy in their models. The greatest improvements are shown for short gene names (which are likely rarer) and when using multiple codebook approaches (LM model and SLLP models combined).
Related Papers
Collobert, ICML 2008 explore the advantages of using deep artificial neural-networks vs shallow classifiers like SVMs for handling various NLP tasks in a unified framework. Their framework was adapted for this paper's approach.
Mann, ACL 2008 explore using semi-supervised learning to extract useful information from features (as opposed to samples) as well, but do so in a generalized expectations framework.