Clinical IE Project F10

From Cohen Courses
Revision as of 14:08, 29 September 2010 by PastStudents (talk | contribs) (Created page with 'Electronic health records are emerging as an economically crucial domain with a variety of information extraction tasks. Unlike biomedical text IE, a well studied problem, clinic…')
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigationJump to search

Electronic health records are emerging as an economically crucial domain with a variety of information extraction tasks. Unlike biomedical text IE, a well studied problem, clinical records are less clean and use more informal syntax. Despite the push for standardization, free text records remain an essential form of communication, conveying nuances that may be excluded or biased when forced into standard forms. Furthermore, data is scarce, requiring valuable expert time for annotation and presenting hurdles related to patient confidentiality. Recently, data has been made available in various open competitions, and we propose to analyze one such set related to classification of radiology reports with symptom/disease tags.

The 2007 Computational Medicine Center challenge provided a set of 1954 clinical records (976 held out for test) for the task of labeling them with one or several of 45 insurance classification codes (ICD-9-CM codes) [1]. Each sample contains typically 2-5 phrases or sentences. They were manually labeled by expert consensus. An example record is of the form:

clinical history: History of hydronephrosis of the left kidney with bilateral vesicoureteral reflux. radiology impression: Stable moderately severe left - sided hydronephrosis and hydroureter. codes: 591, 593.5, 593.70

The entrants applied a variety of approaches, including hand coded rules and learning based classifiers. They achieved on average F1 scores of 0.77 and the best submission scored 0.89.

One of the most fundamental challenges is capturing the broadness of the expert knowledge involved in deciding labels. The ICD-9 codes provide guidelines for classification, which along with some common medical dictionaries, can be leveraged to obtain synonyms for normalizing tokens and constructing features. However, rare terms might be missed.

Another important IE task is to extract modifiers from the context of the key words. Negation plays a crucial role as it can completely reverse the decision. Some negations are more obvious (“No pneumonia”) whereas others are less so (“Right middle and probable right lower lobe pneumonia” – probable negates “right lower” but not “right middle”). Ambiguity and conjectures would typically have to be detected and excluded: (“Findings most consistent with right lower lobe round pneumonia. Follow up x-ray to assess resolution is recommended.”).

This project will therefore involve a series of tasks. The input data will have to be parsed, stemmed, and tokenized, perhaps removing stop words. We will need to build up medical term synonym dictionaries and apply them. A negation detection algorithm will be developed (perhaps using sequential models such as MEMMs) and compared to the predominant regex algorithm in use [2]. It will be difficult to train parameters for some features, due to the large space of possible label-token combinations and the somewhat limited data set. This may necessitate some sort of background model for smoothing or innovative features that span multiple classes. We may explore using a method like in [3] where examining false negatives led to automatic feature creation. These IE challenges, which along with the emerging importance of clinical informatics, should provide ample motivation for this project.

Notes: no project partner yet, but looking for one!

[1] Pestian, J. et al. A Shared Task Involving Multi-label Classification of Clinical Free Text. BioNLP 2007: Biological, translational, and clinical language processing, pages 97–104, Prague, June 2007.

[2] Chapman WW, Bridewell W, Hanbury P, Cooper GF, Buchanan BG. A simple algorithm for identifying ne- gated findings and diseases in discharge summaries. J Biomed Inform 2001;34(5):301-10.

[3] Farkas, R. and Szarvas, G. Automatic construction of rule-based ICD-9-CM coding systems. BMC Bioinformatics (2008), 9 (Suppl 3):S10.