Difference between revisions of "Clinical IE Project F10"

From Cohen Courses
Jump to navigationJump to search
Line 1: Line 1:
 
== Proposal Summary ==
 
== Proposal Summary ==
  
Electronic health records are emerging as an economically crucial domain with a variety of information extraction tasks. Unlike biomedical text IE, a well studied problem, clinical records are less clean and use more informal syntax. Despite the push for standardization, free text records remain an essential form of communication, conveying nuances that may be excluded or biased when forced into standard forms. Furthermore, data is scarce, requiring valuable expert time for annotation and presenting hurdles related to patient confidentiality. Recently, data has been made available in various open competitions, and we propose to analyze one such set related to classification of radiology reports with symptom/disease tags.
+
Here I propose to apply semi-supervised learning to the task of medical text named entity recognition, specifically for diseases and treatments. Public annotated medical text is relatively scarce due to high expert annotation costs and confidentiality restrictions. Thus it makes sense to try to leverage a smaller labeled dataset with a larger unlabeled one to try to improve NER performance.  
  
The 2007 Computational Medicine Center challenge provided a set of 1954 clinical records (976 held out for test) for the task of labeling them with one or several of 45 insurance classification codes (ICD-9-CM codes) [1]. Each sample contains typically 2-5 phrases or sentences. They were manually labeled by expert consensus. An example record is of the form:
+
In the biomedical domain, there is much freely available annotation for bioscience terms (genes, proteins), less for medical terms (diseases, medications, symptoms, etc) in medical research papers, and even less for medical terms in clinical texts such as patient history reports. Due to data availability, this paper will train exclusively on medical terms in medical research papers. However, it will then explore augmenting the training set with unlabeled data either from similar research paper samples or from unlabeled data from medical records. Thus, I will also explore cross-domain semi-supervised learning. Though both data sources will mention diseases, the syntax and language features tend to be different, with clinical texts being much more informal and inconsitent. This presents a major challenge and may adversly affect performance, though it is not clear how badly. The work could be useful to open up extra datasets for clinical NER applications and as an exploration of semi-supervised learning in a new domain.
  
* clinical history: History of hydronephrosis of the left kidney with bilateral vesicoureteral reflux.  
+
The annotated medical research paper data comes from two sources, the Berkeley Biotext Project and the recently released Arizona Disease Corpus. The Berkeley dataset contains 3500 samples with entities labeled for diseases and treatments, along with labeled relations. The ADC set includes 2800 sentences labeled for disease mentions. Additional unannotated research reports may be pulled from MedLine/Pubmed. Unannotated clinical data comes from the 2007 Computational Medicine Center Challenge and potentially also from the i2b2 NLP shared tasks. Those clinical datasets were annotated but not in a way directly useful for NER as much for classification tasks.  
* radiology impression: Stable moderately severe left - sided hydronephrosis and hydroureter.  
 
* codes: 591, 593.5, 593.70
 
  
The entrants applied a variety of approaches, including hand coded rules and learning based classifiers. They achieved on average F1 scores of 0.77 and the best submission scored 0.89.
+
Semi-supervised learning will be explored using both generative models such as HMMs and conditional models such as CRFs and MEMMs. The generative models fit naturally into a semi-supervised task but not so well into modeling complex language features whereas conditional models are more difficult to extend away from strictly supervised learning. Methods have been explored by [][][].
  
One of the most fundamental challenges is capturing the broadness of the expert knowledge involved in deciding labels. The ICD-9 codes provide guidelines for classification, which along with some common medical dictionaries, can be leveraged to obtain synonyms for normalizing tokens and constructing features. However, rare terms might be missed.
+
== Team Members ==
  
Another important IE task is to extract modifiers from the context of the key words. Negation plays a crucial role as it can completely reverse the decision. Some negations are more obvious (“No pneumonia”) whereas others are less so (“Right middle and probable right lower lobe pneumonia” – probable negates “right lower” but not “right middle”). Ambiguity and conjectures would typically have to be detected and excluded: (“Findings most consistent with right lower lobe round pneumonia. Follow up x-ray to assess resolution is recommended.”).
+
James Dang
  
This project will therefore involve a series of tasks. The input data will have to be parsed, stemmed, and tokenized, perhaps removing stop words. We will need to build up medical term synonym dictionaries and apply them. A negation detection algorithm will be developed (perhaps using sequential models such as MEMMs) and compared to the predominant regex algorithm in use [2]. It will be difficult to train parameters for some features, due to the large space of possible label-token combinations and the somewhat limited data set. This may necessitate some sort of background model for smoothing or innovative features that span multiple classes. We may explore using a method like in [3] where examining false negatives led to automatic feature creation. These IE challenges, which along with the emerging importance of clinical informatics, should provide ample motivation for this project.
+
== Sources ==
  
== Comments ==
+
* Rosario, B. and Hearst, M. Classifying semantic relations in bioscience text. Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL 2004), Barcelona, July 2004.
  
No project partner yet, but looking for one!
+
* Leaman, R., Miller, C., and Gonzalez, G. Enabling recognition of diseases in biomedical text with machine learning: corpus and benchmark. Proceedings of the 3rd International Symposium on Languages in Biology and Medicine (2009), 82–89.
  
== Sources ==
+
* Pestian, J. et al. A Shared task involving multi-label classification of clinical free text. BioNLP 2007: Biological, translational, and clinical language processing, 97–104.
 +
 
 +
* Uzuner, O., Goldstein, I., Luo, Y., and Kohane, I.  Identifying patient smoking status from medical discharge records. J Am Med Inform Assoc. 2008; 15(1)15-24.
 +
 
 +
* Jiao, F., Wang, S., Lee, C.H., Greiner, R., and Schuurmans, D. Semi-supervised conditional random fields for improved sequence segmentation and labeling. Proceedings of the 21st International Conference on Computational Linguistics. (2006) 209-216.
 +
 
 +
* Chapelle, O., Scholkopf, B., and Zien, A. Semi-Supervised Learning. MIT Press, 2006.
  
* [1] Pestian, J. et al. A Shared Task Involving Multi-label Classification of Clinical Free Text. BioNLP 2007: Biological, translational, and clinical language processing, pages 97–104, Prague, June 2007.
+
* Kuksa, P. and Qi, Y. Semi-supervised bio-named entity recognition with word-codebook learning. Proceedings of the Tenth SIAM International Conference on Data Mining (2010).
  
* [2] Chapman WW, Bridewell W, Hanbury P, Cooper GF, Buchanan BG. A simple algorithm for identifying ne- gated findings and diseases in discharge summaries. J Biomed Inform 2001;34(5):301-10.
+
* Liao, W. and Veeramachaneni, S. A simple semi-supervised algorithm for named entity recognition. Proceedings of the NAACL HLT Workshop on Semi-supervised Learning for Natural Language Processing (2009), pages 58–65.
  
* [3] Farkas, R. and Szarvas, G. Automatic construction of rule-based ICD-9-CM coding systems. BMC Bioinformatics (2008), 9 (Suppl 3):S10.
+
* Nadeau, D. Semi-supervised named entity recognition: learning to recognize 100 entity types with little supervision. Ph.D. thesis, University of Ottawa, November 2007.

Revision as of 12:47, 8 October 2010

Proposal Summary

Here I propose to apply semi-supervised learning to the task of medical text named entity recognition, specifically for diseases and treatments. Public annotated medical text is relatively scarce due to high expert annotation costs and confidentiality restrictions. Thus it makes sense to try to leverage a smaller labeled dataset with a larger unlabeled one to try to improve NER performance.

In the biomedical domain, there is much freely available annotation for bioscience terms (genes, proteins), less for medical terms (diseases, medications, symptoms, etc) in medical research papers, and even less for medical terms in clinical texts such as patient history reports. Due to data availability, this paper will train exclusively on medical terms in medical research papers. However, it will then explore augmenting the training set with unlabeled data either from similar research paper samples or from unlabeled data from medical records. Thus, I will also explore cross-domain semi-supervised learning. Though both data sources will mention diseases, the syntax and language features tend to be different, with clinical texts being much more informal and inconsitent. This presents a major challenge and may adversly affect performance, though it is not clear how badly. The work could be useful to open up extra datasets for clinical NER applications and as an exploration of semi-supervised learning in a new domain.

The annotated medical research paper data comes from two sources, the Berkeley Biotext Project and the recently released Arizona Disease Corpus. The Berkeley dataset contains 3500 samples with entities labeled for diseases and treatments, along with labeled relations. The ADC set includes 2800 sentences labeled for disease mentions. Additional unannotated research reports may be pulled from MedLine/Pubmed. Unannotated clinical data comes from the 2007 Computational Medicine Center Challenge and potentially also from the i2b2 NLP shared tasks. Those clinical datasets were annotated but not in a way directly useful for NER as much for classification tasks.

Semi-supervised learning will be explored using both generative models such as HMMs and conditional models such as CRFs and MEMMs. The generative models fit naturally into a semi-supervised task but not so well into modeling complex language features whereas conditional models are more difficult to extend away from strictly supervised learning. Methods have been explored by [][][].

Team Members

James Dang

Sources

  • Rosario, B. and Hearst, M. Classifying semantic relations in bioscience text. Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL 2004), Barcelona, July 2004.
  • Leaman, R., Miller, C., and Gonzalez, G. Enabling recognition of diseases in biomedical text with machine learning: corpus and benchmark. Proceedings of the 3rd International Symposium on Languages in Biology and Medicine (2009), 82–89.
  • Pestian, J. et al. A Shared task involving multi-label classification of clinical free text. BioNLP 2007: Biological, translational, and clinical language processing, 97–104.
  • Uzuner, O., Goldstein, I., Luo, Y., and Kohane, I. Identifying patient smoking status from medical discharge records. J Am Med Inform Assoc. 2008; 15(1)15-24.
  • Jiao, F., Wang, S., Lee, C.H., Greiner, R., and Schuurmans, D. Semi-supervised conditional random fields for improved sequence segmentation and labeling. Proceedings of the 21st International Conference on Computational Linguistics. (2006) 209-216.
  • Chapelle, O., Scholkopf, B., and Zien, A. Semi-Supervised Learning. MIT Press, 2006.
  • Kuksa, P. and Qi, Y. Semi-supervised bio-named entity recognition with word-codebook learning. Proceedings of the Tenth SIAM International Conference on Data Mining (2010).
  • Liao, W. and Veeramachaneni, S. A simple semi-supervised algorithm for named entity recognition. Proceedings of the NAACL HLT Workshop on Semi-supervised Learning for Natural Language Processing (2009), pages 58–65.
  • Nadeau, D. Semi-supervised named entity recognition: learning to recognize 100 entity types with little supervision. Ph.D. thesis, University of Ottawa, November 2007.