KeisukeKamataki project abstract
Contents
What you plan to do with what data
I'm planning to work on named entity recoginition and its extension for relation extraction with online biomedical journals indexed by Google scholar. I have already collected 460 thousands (17GB) of text data.
Here, I have some ideas about the relation extraction.
- 1. Gene ontology modeling: Model ontology information of each paper with CRF or HMM and calculate ontology-level similarity of papers
- 2. Author modeling: Given a paper of anonymous author, how likely each author wrote it?
- 3. Source Gene info decoding: Given a paper, what gene(s) does the paper likely talk about?
I would like to try at least one of these(or some additional) ideas. I'll write more concrete plan with the second proposal.
Why you think it is interesting
Getting a large amount of online biomedical journals is becoming easier, but it may not an easy task to organize them according to some practical concepts like content-based similarity, relation, and so on. It might be worth knowing how effectively we can automate such process with current algorithms.
Any relevant superpowers you might have
I have some programming experience and algorithm understanding for generative methods such as HMM and LDA. Also, there are many informative data source to help on the web such as:
How you plan to evaluate your work
- 1.For ontology detection, I am thinking about using online ontology dictionary and assign ground truth data for each document. Then, we separate the data into training/test and measure f-measure. Right now, I don't have a good idea to measure the goodness of document similarity measurement. We may want to use reference information or some other information which may help the evaluation.
- 2. Author modeling would be simple. We can use prediction accuracy. We may be also want to consider f-measure.
- 3. Gene info decoding would also easy. I have a database record which paper is actually related to which gene. So, we can use prediction accuracy and/or f-measure.
What techniques you plan to use
- CRF, MaxEnt, and/or HMM - Micro-level hidden ontology detection
- LDA or Correlated topic modeling - Macro-level topic modeling
What question you want to answer
- 1.1 How accurately can we detect ontologies with named entity?
- 1.2 Can we make use of the result to analyze the relationship of papers?
- 2. How accurately can we predict author names only from paper? Is there any interesting trend for among the authorships?
- 3. How accurately can we detect ontologies with named entity?
Hopefully, I'd like to try several algorithm for one or each task and compare the performance.
Detection you might work with
Although I'm thinking about doing the project alone, I'd be happy to work with anyone. I'm also happy to join other project if they want an additional member.