Difference between revisions of "Cohen Courses:Dmovshov abbreviations"

From Cohen Courses
Jump to navigationJump to search
Line 4: Line 4:
  
 
== Idea ==
 
== Idea ==
Abbreviations, synonyms and acronyms are heavily used in biomedical literature, for describing names of genes, diseases, biological processes and more. Recognizing short or alternative name forms and mapping them to the full form is important to the full understanding of the scientific text. In the context of information extraction tasks, recognizing abbreviated forms can lead to a great increase in recall. This task is especially challenging since abbreviations are often reused, for example, names of genes and systems are shared across species, and since researchers often do not adhere to standard naming conventions. In this project we wish to provide a model for linking an abbreviated or short form biomedical terms to full terms as well as recognize abbreviations that may relate to more than a single entity.
+
Abbreviations, synonyms and acronyms are heavily used in biomedical literature, for describing names of genes, diseases, biological processes and more. Recognizing short or alternative name forms and mapping them to the full (long) form is important to the full understanding of scientific text. In the context of information extraction tasks, recognizing abbreviated forms can lead to a great increase in recall. This task is especially challenging since abbreviations are often reused, for example, names of genes and systems are shared across species, and since researchers often do not adhere to standard naming conventions. In this project we wish to provide a model for linking an abbreviated or short form biomedical terms to full terms as well as recognize abbreviations that may relate to more than a single entity.
 +
 
 +
=== Challenges ===
 +
* Recognize candidate ''<short-form, long-form>'' pairs from text
 +
* Extract possible long-form versions for each of the abbreviated short-forms
 +
* Suggest the most probable long-form of an abbreviation based on the content of a given text document (the base assumption will be that in a single document an abbreviation may only refer to a single long-form - even if it may have many more possible long-forms in other, even closely related, context).
  
 
== Team ==
 
== Team ==
Line 13: Line 18:
 
The data includes:
 
The data includes:
 
:* [http://medstract.com/index.php?f=gold-standard Gold Standard Data]: Sentences including abbreviations.
 
:* [http://medstract.com/index.php?f=gold-standard Gold Standard Data]: Sentences including abbreviations.
:* [http://medstract.com/index.php?f=gold-result Gold Standard Results]: Pairs of abbreviation and full form name, that appear in the data.
+
:* [http://medstract.com/index.php?f=gold-result Gold Standard Results]: Pairs of <abbreviation, full form name>, that appear in the data.
  
 
== Related Work ==
 
== Related Work ==
 
:* Possible Baseline: [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.87.7304&rep=rep1&type=pdf A simple algorithm for identifying abbreviation definitions in biomedical text] by A. S. Schwartz and M. A. Hearst
 
:* Possible Baseline: [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.87.7304&rep=rep1&type=pdf A simple algorithm for identifying abbreviation definitions in biomedical text] by A. S. Schwartz and M. A. Hearst
:* Candidate pairs extraction algorithm: [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.28.8821&rep=rep1&type=pdf Hybrid text mining for finding abbreviations and their definitions] by Youngja Park and Roy J. Byrd
+
:* Candidate-pairs extraction algorithm: [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.28.8821&rep=rep1&type=pdf Hybrid text mining for finding abbreviations and their definitions] by Youngja Park and Roy J. Byrd
 
:* [http://www.springerlink.com/content/8x017850p6473r82/ An Automatic Identification and Resolution System for Protein-Related Abbreviations in Scientific Papers] by Paolo Atzeni, Fabio Polticelli and Daniele Toti
 
:* [http://www.springerlink.com/content/8x017850p6473r82/ An Automatic Identification and Resolution System for Protein-Related Abbreviations in Scientific Papers] by Paolo Atzeni, Fabio Polticelli and Daniele Toti
 
:* [http://171.67.114.118/content/9/3/262.abstract Mapping Abbreviations to Full Forms in Biomedical Articles] by Hong Yu, George Hripcsak and Carol Friedman
 
:* [http://171.67.114.118/content/9/3/262.abstract Mapping Abbreviations to Full Forms in Biomedical Articles] by Hong Yu, George Hripcsak and Carol Friedman

Revision as of 10:55, 12 September 2011

Course Page

Identifying Abbreviations in Biomedical Text

Idea

Abbreviations, synonyms and acronyms are heavily used in biomedical literature, for describing names of genes, diseases, biological processes and more. Recognizing short or alternative name forms and mapping them to the full (long) form is important to the full understanding of scientific text. In the context of information extraction tasks, recognizing abbreviated forms can lead to a great increase in recall. This task is especially challenging since abbreviations are often reused, for example, names of genes and systems are shared across species, and since researchers often do not adhere to standard naming conventions. In this project we wish to provide a model for linking an abbreviated or short form biomedical terms to full terms as well as recognize abbreviations that may relate to more than a single entity.

Challenges

  • Recognize candidate <short-form, long-form> pairs from text
  • Extract possible long-form versions for each of the abbreviated short-forms
  • Suggest the most probable long-form of an abbreviation based on the content of a given text document (the base assumption will be that in a single document an abbreviation may only refer to a single long-form - even if it may have many more possible long-forms in other, even closely related, context).

Team

Dana Movshovitz-Attias

Dataset

MEDSTRACT is a collection of automatically extracted acronym pairs from MEDLINE databases. The data includes:

Related Work