Difference between revisions of "Clinical IE Project F10"
PastStudents (talk | contribs) |
PastStudents (talk | contribs) |
||
(One intermediate revision by the same user not shown) | |||
Line 1: | Line 1: | ||
== Proposal Summary == | == Proposal Summary == | ||
− | Here I propose to apply semi-supervised learning to the task of medical text named entity recognition, specifically for diseases and treatments. Public annotated medical text is relatively scarce due to high expert annotation costs and confidentiality restrictions. Thus it makes sense to try to leverage a smaller labeled | + | Here I propose to apply semi-supervised learning to the task of medical text named entity recognition, specifically for diseases and treatments. Public annotated medical text is relatively scarce due to high expert annotation costs and confidentiality restrictions. Thus it makes sense to try to leverage a smaller labeled data set with a larger unlabeled one to try to improve NER performance. |
− | In the biomedical domain, there is much freely available annotation for bioscience terms (genes, proteins), less for medical terms (diseases, medications, symptoms, etc) in medical research papers, and even less for medical terms in clinical texts such as patient history reports. Due to data availability, this paper will train exclusively on medical terms in medical research papers. However, it will then explore augmenting the training set with unlabeled data either from similar research paper samples or from unlabeled data from medical records. Thus, I will also explore cross-domain semi-supervised learning. Though both data sources will mention diseases, the syntax and language features tend to be different, with clinical texts being much more informal and | + | In the biomedical domain, there is much freely available annotation for bioscience terms (genes, proteins), less for medical terms (diseases, medications, symptoms, etc) in medical research papers, and even less for medical terms in clinical texts such as patient history reports. Due to data availability, this paper will train exclusively on medical terms in medical research papers. However, it will then explore augmenting the training set with unlabeled data either from similar research paper samples or from unlabeled data from medical records. Thus, I will also explore cross-domain semi-supervised learning. Though both data sources will mention diseases, the syntax and language features tend to be different, with clinical texts being much more informal and inconsistent. This presents a major challenge and may adversely affect performance, though it is not clear how badly. The work could be useful to open up extra data sets for clinical NER applications and as an exploration of semi-supervised learning in a new domain. |
− | The annotated medical research paper data comes from two sources, the Berkeley Biotext Project [1] and the recently released Arizona Disease Corpus [2]. The Berkeley | + | The annotated medical research paper data comes from two sources, the Berkeley Biotext Project [1] and the recently released Arizona Disease Corpus [2]. The Berkeley data set contains 3500 samples with entities labeled for diseases and treatments, along with labeled relations. The Arizona set includes 2800 sentences labeled for disease mentions. Additional unannotated research reports may be pulled from MedLine/Pubmed. Unannotated clinical data comes from the 2007 Computational Medicine Center Challenge [3] and potentially also from the i2b2 NLP shared tasks [4]. Those clinical data sets were annotated but not in a way directly useful for NER as much as for classification tasks. |
Semi-supervised learning will be explored using both generative models such as HMMs and conditional models such as CRFs and MEMMs. The generative models fit naturally into a semi-supervised task but not so well into modeling complex language features whereas conditional models are more difficult to extend away from strictly supervised learning. Methods have been explored by [5,6,7,8,9]. Evaluation will come from testing on held out data from the two labeled sets [1,2]. | Semi-supervised learning will be explored using both generative models such as HMMs and conditional models such as CRFs and MEMMs. The generative models fit naturally into a semi-supervised task but not so well into modeling complex language features whereas conditional models are more difficult to extend away from strictly supervised learning. Methods have been explored by [5,6,7,8,9]. Evaluation will come from testing on held out data from the two labeled sets [1,2]. |
Latest revision as of 12:11, 8 October 2010
Proposal Summary
Here I propose to apply semi-supervised learning to the task of medical text named entity recognition, specifically for diseases and treatments. Public annotated medical text is relatively scarce due to high expert annotation costs and confidentiality restrictions. Thus it makes sense to try to leverage a smaller labeled data set with a larger unlabeled one to try to improve NER performance.
In the biomedical domain, there is much freely available annotation for bioscience terms (genes, proteins), less for medical terms (diseases, medications, symptoms, etc) in medical research papers, and even less for medical terms in clinical texts such as patient history reports. Due to data availability, this paper will train exclusively on medical terms in medical research papers. However, it will then explore augmenting the training set with unlabeled data either from similar research paper samples or from unlabeled data from medical records. Thus, I will also explore cross-domain semi-supervised learning. Though both data sources will mention diseases, the syntax and language features tend to be different, with clinical texts being much more informal and inconsistent. This presents a major challenge and may adversely affect performance, though it is not clear how badly. The work could be useful to open up extra data sets for clinical NER applications and as an exploration of semi-supervised learning in a new domain.
The annotated medical research paper data comes from two sources, the Berkeley Biotext Project [1] and the recently released Arizona Disease Corpus [2]. The Berkeley data set contains 3500 samples with entities labeled for diseases and treatments, along with labeled relations. The Arizona set includes 2800 sentences labeled for disease mentions. Additional unannotated research reports may be pulled from MedLine/Pubmed. Unannotated clinical data comes from the 2007 Computational Medicine Center Challenge [3] and potentially also from the i2b2 NLP shared tasks [4]. Those clinical data sets were annotated but not in a way directly useful for NER as much as for classification tasks.
Semi-supervised learning will be explored using both generative models such as HMMs and conditional models such as CRFs and MEMMs. The generative models fit naturally into a semi-supervised task but not so well into modeling complex language features whereas conditional models are more difficult to extend away from strictly supervised learning. Methods have been explored by [5,6,7,8,9]. Evaluation will come from testing on held out data from the two labeled sets [1,2].
Team Members
James Dang
super powers: super sleepy
Sources
- [1] Rosario, B. and Hearst, M. Classifying semantic relations in bioscience text. Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL 2004), Barcelona, July 2004.
- [2] Leaman, R., Miller, C., and Gonzalez, G. Enabling recognition of diseases in biomedical text with machine learning: corpus and benchmark. Proceedings of the 3rd International Symposium on Languages in Biology and Medicine (2009), 82–89.
- [3] Pestian, J. et al. A Shared task involving multi-label classification of clinical free text. BioNLP 2007: Biological, translational, and clinical language processing, 97–104.
- [4] Uzuner, O., Goldstein, I., Luo, Y., and Kohane, I. Identifying patient smoking status from medical discharge records. J Am Med Inform Assoc. 2008; 15(1)15-24.
- [5] Jiao, F., Wang, S., Lee, C.H., Greiner, R., and Schuurmans, D. Semi-supervised conditional random fields for improved sequence segmentation and labeling. Proceedings of the 21st International Conference on Computational Linguistics. (2006) 209-216.
- [6] Chapelle, O., Scholkopf, B., and Zien, A. Semi-Supervised Learning. MIT Press, 2006.
- [7] Kuksa, P. and Qi, Y. Semi-supervised bio-named entity recognition with word-codebook learning. Proceedings of the Tenth SIAM International Conference on Data Mining (2010).
- [8] Liao, W. and Veeramachaneni, S. A simple semi-supervised algorithm for named entity recognition. Proceedings of the NAACL HLT Workshop on Semi-supervised Learning for Natural Language Processing (2009), pages 58–65.
- [9] Nadeau, D. Semi-supervised named entity recognition: learning to recognize 100 entity types with little supervision. Ph.D. thesis, University of Ottawa, November 2007.