Identifying Abbreviations in Biomedical Text

Idea

Abbreviations, synonyms and acronyms are heavily used in biomedical literature, for describing names of genes, diseases, biological processes and more. Recognizing short or alternative name forms and mapping them to the full (long) form is important to the full understanding of scientific text. In the context of information extraction tasks, recognizing abbreviated forms can lead to a great increase in recall. This task is especially challenging since abbreviations are often reused, for example, names of genes and systems are shared across species, and since researchers often do not adhere to standard naming conventions. In this project we wish to provide a model for linking an abbreviated or short form biomedical terms to full terms as well as recognize abbreviations that may relate to more than a single entity.

Currently used approaches only recognize abbreviations within a single sentence that contains both the short and long form. The goal here is to suggest the most probable long-form even when it appears elsewhere in the document (or in related documents).

Approach

Recognize candidate <short-form, long-form> pairs from text
Extract possible long-form versions for each of the abbreviated short-forms
Suggest the most probable long-form of each abbreviation in a set of documents (the base assumption will be that in a single document an abbreviation may only refer to a single long-form, even if it may have many more possible long-forms in other, even closely related, context).

Team

Dana Movshovitz-Attias

Data

Abbreviations Dataset

MEDSTRACT is a collection of automatically extracted acronym pairs from MEDLINE databases. The data includes:

Gold Standard Data: 400 abstracts including abbreviations.
I wasn't able to find the gold standard labeling of abbreviation pairs so I annotated the data myself. The annotation includes pairs of <abbreviation, full form name> that appear in each abstract.

Corpus

Full length documents taken from:

PubMed Central open access archive documents

So far found no data for long-forms of abbreviations in full documents - may have to manually label some. Alternatively, can use the MEDSTRACT gold standard list as a "complete" list of known abbreviations and ignore all others in the corpus.

Baseline

A simple algorithm for identifying abbreviation definitions in biomedical text by A. S. Schwartz and M. A. Hearst

I implemented the baseline method and tested it on the Medstract dataset.

Results on Medstract Data

precision 88%
recall 87%
f1 87%

Related Work

Baseline method: A simple algorithm for identifying abbreviation definitions in biomedical text by A. S. Schwartz and M. A. Hearst
Hybrid text mining for finding abbreviations and their definitions by Youngja Park and Roy J. Byrd
An Automatic Identification and Resolution System for Protein-Related Abbreviations in Scientific Papers by Paolo Atzeni, Fabio Polticelli and Daniele Toti
Mapping Abbreviations to Full Forms in Biomedical Articles by Hong Yu, George Hripcsak and Carol Friedman

Comments from William

Nice to see this coming along! --Wcohen 20:53, 22 September 2011 (UTC)

Cohen Courses:Dmovshov abbreviations

Contents

Identifying Abbreviations in Biomedical Text

Idea

Approach

Team

Data

Abbreviations Dataset

Corpus

Baseline

Results on Medstract Data

Related Work

Comments from William

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

Tools