Sgopal1 Project Abstract

From Cohen Courses
Jump to navigationJump to search

I plan to work on the topic of relation extraction. More specifically, identifying the interaction between proteins given the abstract of the paper. There have been several papers and standard benchmark datasets available for comparing different methods. Although I haven’t chosen a particular dataset yet, I have the option of choosing between AIMed, LLL05 task challenge and IEPA. LLL05 seems to be a well parsed dataset with almost no pre-processing required, but has less than a two hundred training/testing examples.

There has been work along different lines

a. Extracting the relations by using different features. The features are extracted from some lexical resources or syntactic structure. Features can also be induced from the dependency parsing of the sentence.

b. Define a kernel function between different object without enumerating the different features. This includes approaches like subsequence kernels, string kernels, tree kernels - general forms of convolution kernel. Some of the kernel approaches also use the feature-sets defined above for generating the similarities between objects.

I think that many kernel based approaches have been tried and evaluated, but latent semantic kernels has been left out. It would be useful to investigate whether creating composite kernels by combining evidence from latent semantic kernels would help or not. It might probably be useful to incorporate LSK before learning any other kernel. At a highlevel, LSK tried to collapse words which share the same meaning. One could argue that we would not need to resort to LSK when we have the perfect set of features. But unfortunately, generating the features requires intuition and domain knowledge and in general, is not possible to generate the perfect set of features that are indicative of a relation between proteins.

On an unrelated note, there has not been even one nearest neighbor approach to such problems. This might probably be because of the unsteady nature of the defined features, which do not strongly support the nearest neighbor assumption that similar points are close to each other. I also want to think about where nearest neighbor approaches would fit in.

The evaluation methods for this task is pretty straightforward, either I extract the relation between the two proteins or don’t. Standard measures such as precision, recall, F1 can be used.

I might be interested in working with selen from whom I got to know about this area of research