Difference between revisions of "Cohen and Hersh Briefings in Bioinformatics 2005"

Revision as of 23:49, 30 September 2010

Citation

Aaron M. Cohen and William R. Hersh. 2005. A Survey of Current Work in Biomedical Text Mining. Briefings in Bioinformatics. Vol 6. No 1. 57-71.

Online version

oxfordjournals

Summary

This is a survey paper about biomedical text mining in 2005.

They describe the state of the art in 2005 for each distinct type of text-mining task below.

Named entity recognition
- Problems
  - No complete dictionary for most types of biological named entities
  - ambiguous words and phrases
  - multi names
- approaches
  - lexicon based
  - rule based
    - AbGene system of Tanabe and Wilbur
    - GAPSCORE system
  - statistically based
- performance
  - overall, the performance of gene and protein NER systems is F-scores between 75 and 85 percent.

Text classification

Synonym and abbreviation extraction

Relationship extraction

Hypothesis generation

Integration frameworks

Related papers

The widely cited Pang et al EMNLP 2002 paper was influenced by this paper - but considers supervised learning techniques. The choice of movie reviews as the domain was suggested by the (relatively) poor performance of Turney's method on movies.

An interesting follow-up paper is Turney and Littman, TOIS 2003 which focuses on evaluation of the technique of using PMI for predicting the semantic orientation of words.

@@ Line 8: / Line 8: @@
 == Summary ==
+This is a survey paper about biomedical text mining in 2005.
-The paper presents a MEDical Information Extraction (MedIE) system, which extracts patient information from free-text clinical records.
+They describe the state of the art in 2005 for each distinct type of text-mining task below.
-They divided their extraction job into three tasks below.
+* Named entity recognition
-* extraction of medical terms
+** Problems
-* relation extraction
+*** No complete dictionary for most types of biological named entities
-** extraction of associated medical concepts
+*** ambiguous words and phrases
-** e.g. Blood pressure & 144/90 in the sentence, "Blood pressure is 144/90"
+*** multi names
-* text classification
+** approaches
-** e.g. a patient can be classified as a former smoker, a current smoker, or a non-smoker
+*** lexicon based
+*** rule based
+**** AbGene system of Tanabe and Wilbur
+**** GAPSCORE system
+*** statistically based
+** performance
+*** overall, the performance of gene and protein NER systems is F-scores between 75 and 85 percent.
-Their approaches are:
+* Text classification
-* An ontology-based approach for extracting medical terms of interest
-** they used Unified Medical Language System (UMLS)
-** About terms that are not defined in UMLS, they predicted categories of some terms using sentence structures.
-* A graph-based approach which uses the parsing result of link-grammar parser for relation-extraction
-** They included the processing of negation.
-** When the parser fails, they used a pattern-based approach.
-** Because the parser did not process multi-word terms, they replaced the terms with placeholders.
-* an NLP-based feature extraction method coupled with an ID3-based decision tree for text classification
-This approach was fairly successful mostly showing over 80% of precision and recall. However, the system was tested on the data written by only a clinician, which means that the style of free-text records was consistent. Nevertheless, the research is worth in that they applied various IE techniques to the free-text clinical records, explain about the problems they encountered.
+* Synonym and abbreviation extraction
+* Relationship extraction
+* Hypothesis generation
+* Integration frameworks
 == Related papers ==

Difference between revisions of "Cohen and Hersh Briefings in Bioinformatics 2005"

Revision as of 23:49, 30 September 2010

Contents

Citation

Online version

Summary

Related papers

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

Tools