Selen project abstract
Relation Extraction applied to articles related to the Immune System
Information extraction methods have been applied to bioinformatics in several different ways, such as gene/protein entity recognition, identifying protein-protein interactions, gene-disease interactions and so on. However, to my knowledge, the interactions between immune system components have not been investigated enough within the information extraction domain.
The most recent work along this line is of Shen-Orr et al.’s where they link 6 cell types and 38 cytokines (signaling proteins that has crucial importance in immune system) by automatically extracting the information from PubMed abstracts. However, they don’t represent further knowledge such as up/down regulation, which I believe has higher importance than just knowing whether a relationship exists.
My goal is to extract information from PubMed abstracts to come up with a directed network that is comprehensive in the sense that it captures pairwise interactions between various immune system components, such as cytokines (Il2, Il4 etc), cells (T cells, B cells, macrophages...). Immune system is highly complex in its behavior, and I believe that such a network will be extremely useful for many biologists, and computer scientists working in bioinformatics.
However there are several problems implicit in this task: A. How to identify components B. How to extract relations between components
If we limit the scope of the problem by explicitly using a dictionary of the components, then the problem reduces to finding out if a relation exists between them, and if so what type of relation exists by which of the components.
I plan to use one of the following methods to extract relations: SVMs or CRFs. CRFs can be applied to this task to both extract a relation and identify what kind of a relationship it is, (does it upregulate or downregulate). I am planning to use a hierarchical CRF, although I don’t yet know the type of hierarchy implicit in the problem. SVMs can also be used with a clever choice of kernel and feature sets. Although I haven’t figured out exactly what kind of features I will be using, possible feature sets are, proximity to an entity that has digits, dashes, parenthesis or mixed cases, or certain other domain specific features.
Training sets can be obtained from past challenges such as LLL05 challenge. The method will be evaluated based on its f-measure. It will then be applied to PubMed abstracts that has the keywords, immune response, innate/adaptive immunity or other related words and results will be compared to those of Shen-Orr et al.’s.
I might be working with Siddharth in this project.
Shen-Orr SS, Goldberger O, Garten Y, Rosenberg-Hasson Y, Lovelace PA, Hirschberg DL, Altman RB, Davis MM, Butte AJ. Towards a cytokine-cell interaction knowledgebase of the adaptive immune system. Pac Symp Biocomput. 2009; 439–450.