Difference between revisions of "Cohen Courses:Dmovshov abbreviations"

From Cohen Courses
Jump to navigationJump to search
 
(12 intermediate revisions by 2 users not shown)
Line 6: Line 6:
 
Abbreviations, synonyms and acronyms are heavily used in biomedical literature, for describing names of genes, diseases, biological processes and more. Recognizing short or alternative name forms and mapping them to the full (long) form is important to the full understanding of scientific text. In the context of information extraction tasks, recognizing abbreviated forms can lead to a great increase in recall. This task is especially challenging since abbreviations are often reused, for example, names of genes and systems are shared across species, and since researchers often do not adhere to standard naming conventions. In this project we wish to provide a model for linking an abbreviated or short form biomedical terms to full terms as well as recognize abbreviations that may relate to more than a single entity.
 
Abbreviations, synonyms and acronyms are heavily used in biomedical literature, for describing names of genes, diseases, biological processes and more. Recognizing short or alternative name forms and mapping them to the full (long) form is important to the full understanding of scientific text. In the context of information extraction tasks, recognizing abbreviated forms can lead to a great increase in recall. This task is especially challenging since abbreviations are often reused, for example, names of genes and systems are shared across species, and since researchers often do not adhere to standard naming conventions. In this project we wish to provide a model for linking an abbreviated or short form biomedical terms to full terms as well as recognize abbreviations that may relate to more than a single entity.
  
Currently used approaches only recognize abbreviations within a single sentence that contains both the short and long form. The goal here is to suggest the most probable long-form even when it appears elsewhere in the document (or in related documents).
+
The current baseline by Schwartz and Hearst provides a way of extracting abbreviation pairs that are mentioned in close proximity in text. The goal of this project is to provide a more robust extraction model, as well as provide a model that can estimate the likelihood of a pair of short and long form mentions being an abbreviation pair. This type of model can be used for evaluating abbreviation pairs even when they are not mentioned in close proximity in a document, and with this model we can select the most likely pair when given several probably options.  
  
=== Approach ===
+
<!--Currently used approaches only recognize abbreviations within a single sentence that contains both the short and long form. The goal here is to suggest the most probable long-form even when it appears elsewhere in the document (or in related documents).-->
* Recognize candidate ''<short-form, long-form>'' pairs from text
+
 
* Extract possible long-form versions for each of the abbreviated short-forms
+
== Approach ==
* Suggest the most probable long-form of each abbreviation in a set of documents (the base assumption will be that in a single document an abbreviation may only refer to a single long-form, even if it may have many more possible long-forms in other, even closely related, context).
+
The abbreviation extraction process is done in two steps:
 +
# Recognize candidate ''<short-form, long-form>'' pairs from text.
 +
# Extract possible long-form versions for each of the abbreviated short-forms.
 +
 
 +
In this project I will try to improve both the recognition and extraction steps. The main emphasis will be on developing an extraction model. Given a candidate long form and short form pair, the model will provide the most likely alignment of the two, attempting to match (align) short form letters to their long form equivalent positions. I will use some type of sequential model, with one of the main considerations being that in some cases it is desirable for the model to relate to an entire word but in others it is better to relate to a single letter/symbol at a time.
  
 
== Team ==
 
== Team ==
Line 21: Line 25:
 
[http://medstract.com/ MEDSTRACT] is a collection of automatically extracted acronym pairs from MEDLINE databases.
 
[http://medstract.com/ MEDSTRACT] is a collection of automatically extracted acronym pairs from MEDLINE databases.
 
The data includes:
 
The data includes:
:* [http://medstract.com/index.php?f=gold-standard Gold Standard Data]: Sentences including abbreviations.
+
:* [http://medstract.com/index.php?f=gold-standard Gold Standard Data]: 400 abstracts including abbreviations.
:* [http://medstract.com/index.php?f=gold-result Gold Standard Results]: Pairs of ''<abbreviation, full form name>'' that appear in the data.
+
:* I wasn't able to find the gold standard labeling of abbreviation pairs so I annotated the data myself. The annotation includes pairs of ''<abbreviation, full form name>'' that appear in each abstract.
:*:* The data does not seem completely coherent and may need to be "cleaned".
+
<!--:* [http://medstract.com/index.php?f=gold-result Gold Standard Results]: Pairs of ''<abbreviation, full form name>'' that appear in the data.-->
 +
 
  
 
=== Corpus ===  
 
=== Corpus ===  
Line 30: Line 35:
  
 
So far found no data for long-forms of abbreviations in full documents - may have to manually label some. Alternatively, can use the MEDSTRACT gold standard list as a "complete" list of known abbreviations and ignore all others in the corpus.
 
So far found no data for long-forms of abbreviations in full documents - may have to manually label some. Alternatively, can use the MEDSTRACT gold standard list as a "complete" list of known abbreviations and ignore all others in the corpus.
 +
 +
== Baseline==
 +
[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.87.7304&rep=rep1&type=pdf A simple algorithm for identifying abbreviation definitions in biomedical text] by A. S. Schwartz and M. A. Hearst
 +
 +
I implemented the baseline method and tested it on the Medstract dataset.
 +
 +
===Results on Medstract Data===
 +
*Precision: 88%
 +
*Recall: 87%
 +
* F1: 87%
  
 
== Related Work ==
 
== Related Work ==
:* Possible Baseline (only for sentences that contain both the short and long form): [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.87.7304&rep=rep1&type=pdf A simple algorithm for identifying abbreviation definitions in biomedical text] by A. S. Schwartz and M. A. Hearst
+
 
 +
=== Recognizing abbreviations===
 +
:* Baseline method: [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.87.7304&rep=rep1&type=pdf A simple algorithm for identifying abbreviation definitions in biomedical text] by A. S. Schwartz and M. A. Hearst
 
:* [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.28.8821&rep=rep1&type=pdf Hybrid text mining for finding abbreviations and their definitions] by Youngja Park and Roy J. Byrd
 
:* [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.28.8821&rep=rep1&type=pdf Hybrid text mining for finding abbreviations and their definitions] by Youngja Park and Roy J. Byrd
 
:* [http://www.springerlink.com/content/8x017850p6473r82/ An Automatic Identification and Resolution System for Protein-Related Abbreviations in Scientific Papers] by Paolo Atzeni, Fabio Polticelli and Daniele Toti
 
:* [http://www.springerlink.com/content/8x017850p6473r82/ An Automatic Identification and Resolution System for Protein-Related Abbreviations in Scientific Papers] by Paolo Atzeni, Fabio Polticelli and Daniele Toti
 
:* [http://171.67.114.118/content/9/3/262.abstract Mapping Abbreviations to Full Forms in Biomedical Articles] by Hong Yu, George Hripcsak and Carol Friedman
 
:* [http://171.67.114.118/content/9/3/262.abstract Mapping Abbreviations to Full Forms in Biomedical Articles] by Hong Yu, George Hripcsak and Carol Friedman
 +
 +
=== String alignment===
 +
* [[RelatedPaper::Ristad_and_Yianilos_1997_Learning_String_Edit_Distance | "Learning String Edit Distance" by Ristad and Yianilos, 1997]]
 +
* [[RelatedPaper::Bilenko_and_Mooney_2003_Adaptive_duplicate_detection_using_learnable_string_similarity_measures | "Adaptive duplicate detection using learnable string similarity measures" by Bilenko and Mooney, ACM SIGKDD, 2003]]
 +
* Bellare and Pereira - "A CRF for discriminatively-trained finite-state string edit distance"
 +
 +
== Comments from William ==
 +
 +
Nice to see this coming along!  --[[User:Wcohen|Wcohen]] 20:53, 22 September 2011 (UTC)
 +
 +
== More Comments from William ==
 +
 +
Nice to see the baseline results already here! --[[User:Wcohen|Wcohen]] 14:38, 11 October 2011 (UTC)

Latest revision as of 09:38, 11 October 2011

Course Page

Identifying Abbreviations in Biomedical Text

Idea

Abbreviations, synonyms and acronyms are heavily used in biomedical literature, for describing names of genes, diseases, biological processes and more. Recognizing short or alternative name forms and mapping them to the full (long) form is important to the full understanding of scientific text. In the context of information extraction tasks, recognizing abbreviated forms can lead to a great increase in recall. This task is especially challenging since abbreviations are often reused, for example, names of genes and systems are shared across species, and since researchers often do not adhere to standard naming conventions. In this project we wish to provide a model for linking an abbreviated or short form biomedical terms to full terms as well as recognize abbreviations that may relate to more than a single entity.

The current baseline by Schwartz and Hearst provides a way of extracting abbreviation pairs that are mentioned in close proximity in text. The goal of this project is to provide a more robust extraction model, as well as provide a model that can estimate the likelihood of a pair of short and long form mentions being an abbreviation pair. This type of model can be used for evaluating abbreviation pairs even when they are not mentioned in close proximity in a document, and with this model we can select the most likely pair when given several probably options.


Approach

The abbreviation extraction process is done in two steps:

  1. Recognize candidate <short-form, long-form> pairs from text.
  2. Extract possible long-form versions for each of the abbreviated short-forms.

In this project I will try to improve both the recognition and extraction steps. The main emphasis will be on developing an extraction model. Given a candidate long form and short form pair, the model will provide the most likely alignment of the two, attempting to match (align) short form letters to their long form equivalent positions. I will use some type of sequential model, with one of the main considerations being that in some cases it is desirable for the model to relate to an entire word but in others it is better to relate to a single letter/symbol at a time.

Team

Dana Movshovitz-Attias

Data

Abbreviations Dataset

MEDSTRACT is a collection of automatically extracted acronym pairs from MEDLINE databases. The data includes:

  • Gold Standard Data: 400 abstracts including abbreviations.
  • I wasn't able to find the gold standard labeling of abbreviation pairs so I annotated the data myself. The annotation includes pairs of <abbreviation, full form name> that appear in each abstract.


Corpus

Full length documents taken from:

So far found no data for long-forms of abbreviations in full documents - may have to manually label some. Alternatively, can use the MEDSTRACT gold standard list as a "complete" list of known abbreviations and ignore all others in the corpus.

Baseline

A simple algorithm for identifying abbreviation definitions in biomedical text by A. S. Schwartz and M. A. Hearst

I implemented the baseline method and tested it on the Medstract dataset.

Results on Medstract Data

  • Precision: 88%
  • Recall: 87%
  • F1: 87%

Related Work

Recognizing abbreviations

String alignment

Comments from William

Nice to see this coming along! --Wcohen 20:53, 22 September 2011 (UTC)

More Comments from William

Nice to see the baseline results already here! --Wcohen 14:38, 11 October 2011 (UTC)