Difference between revisions of "Cohen and Hersh Briefings in Bioinformatics 2005"

From Cohen Courses
Jump to navigationJump to search
 
(3 intermediate revisions by the same user not shown)
Line 9: Line 9:
 
== Summary ==
 
== Summary ==
 
This is a survey paper about biomedical text mining in 2005.  
 
This is a survey paper about biomedical text mining in 2005.  
 
They describe the state of the art in 2005 for each distinct type of text-mining task below.
 
  
 
* Named entity recognition
 
* Named entity recognition
Line 17: Line 15:
 
*** ambiguous words and phrases
 
*** ambiguous words and phrases
 
*** multi names
 
*** multi names
** approaches
+
** approaches are mainly categorized into three below
 
*** lexicon based
 
*** lexicon based
 
*** rule based
 
*** rule based
**** AbGene system of Tanabe and Wilbur
 
**** GAPSCORE system
 
 
*** statistically based
 
*** statistically based
 
** performance
 
** performance
Line 30: Line 26:
  
 
* Synonym and abbreviation extraction
 
* Synonym and abbreviation extraction
 +
** Synonym
 +
*** use dictionary
 +
*** automatic extraction of gene name synonyms from biomedical free text
 +
*** SVM classifier-based
 +
*** pattern-based
 +
** abbreviation
 +
*** either the full form or the abbreviation is often enclosed in parentheses.
 +
*** a variety of alignment and scoring methods
  
 
* Relationship extraction
 
* Relationship extraction
 +
** detect occurrences of a prespecified type of relationship between a pair of entities of given types
 +
** manually generated template-based methods
 +
** automatic template methods
 +
** statistical methods
 +
** NLP-based methods
 +
mostly are about the relationships between genes and proteins
  
 
* Hypothesis generation
 
* Hypothesis generation
 +
** uncover relationships that are not present in the text but instead are inferred by the presence of other more explicit relationships. uncover previously unrecognized relationships
  
 
* Integration frameworks
 
* Integration frameworks
 +
** integrated text-mining frameworks
 +
** still in the research and development phrase.
  
== Related papers ==
+
* The authors' suggestions
 
+
** Access to full text is required
The widely cited [[RelatedPaper::Pang et al EMNLP 2002]] paper was influenced by this paper - but considers supervised learning techniques. The choice of movie reviews as the domain was suggested by the (relatively) poor performance of Turney's method on movies.
+
** Additional analytical methods with possible features are required for a particular application
 
+
** Researchers should consider actual users' needs. The performance of a system with certain metrics does not guarantee users' satisfaction.
An interesting follow-up paper is [[RelatedPaper::Turney and Littman, TOIS 2003]] which focuses on evaluation of the technique of using PMI for predicting the [[semantic orientation of words]].
+
** Shared challenge tasks should be continued

Latest revision as of 00:23, 1 October 2010

Citation

Aaron M. Cohen and William R. Hersh. 2005. A Survey of Current Work in Biomedical Text Mining. Briefings in Bioinformatics. Vol 6. No 1. 57-71.

Online version

oxfordjournals

Summary

This is a survey paper about biomedical text mining in 2005.

  • Named entity recognition
    • Problems
      • No complete dictionary for most types of biological named entities
      • ambiguous words and phrases
      • multi names
    • approaches are mainly categorized into three below
      • lexicon based
      • rule based
      • statistically based
    • performance
      • overall, the performance of gene and protein NER systems is F-scores between 75 and 85 percent.
  • Text classification


  • Synonym and abbreviation extraction
    • Synonym
      • use dictionary
      • automatic extraction of gene name synonyms from biomedical free text
      • SVM classifier-based
      • pattern-based
    • abbreviation
      • either the full form or the abbreviation is often enclosed in parentheses.
      • a variety of alignment and scoring methods
  • Relationship extraction
    • detect occurrences of a prespecified type of relationship between a pair of entities of given types
    • manually generated template-based methods
    • automatic template methods
    • statistical methods
    • NLP-based methods

mostly are about the relationships between genes and proteins

  • Hypothesis generation
    • uncover relationships that are not present in the text but instead are inferred by the presence of other more explicit relationships. uncover previously unrecognized relationships
  • Integration frameworks
    • integrated text-mining frameworks
    • still in the research and development phrase.
  • The authors' suggestions
    • Access to full text is required
    • Additional analytical methods with possible features are required for a particular application
    • Researchers should consider actual users' needs. The performance of a system with certain metrics does not guarantee users' satisfaction.
    • Shared challenge tasks should be continued