Miwa 2009 a rich feature vector for protein protein interaction extraction from multiple corpora

From Cohen Courses
Jump to navigationJump to search

Citation

A Rich Feature Vector for Protein-Protein Interaction Extraction from Multiple Corpora, by M. Miwa, R. S\aetre, Y. Miyao, J. Tsujii. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, 2009.

Online Version

Here is the online version of the paper.

Summary

Because of the importance of protein protein interaction (PPI) extraction from text, many corpora have been proposed with slightly differing definitions of proteins and PPI. Since no single corpus is large enough to saturate a machine learning system, it is necessary to learn from multiple different corpora. In this paper the authors propose the extraction of PPIs from multiple different corpora. They design a rich feature vector, and as an Inductive Transfer Learning (ITL) method, apply a support vector machine (SVM) modified for corpus weighting (SVM-CW), in order to evaluate the use of multiple corpora for the PPI extraction task. The authors show that the system with their feature vector was better than or at least comparable to the state-of-the-art PPI extraction systems on every corpus. While SVM-CW is simple, SVM-CW can improve the performance of the system more effectively and more efficiently than other methods proven to be successful in other NLP tasks earlier.

Brief description of the method

The target task of the system is a sentence-based,

Figure 1: Overview of PPI Extraction System

pair-wise PPI extraction which is formulated as a classification problem that judges whether a given pair of proteins in a sentence is interacting or not. Figure 1 shows the overview of the proposed PPI extraction system. As a classifier using a single corpus, the 2-norm soft-margin linear SVM (L2-SVM) classifier was used, with the dual coordinate decent (DCD) method.

The feature vector contains three types of features, corresponding to the three different kernels, which were each combined with the two parsers: the Enju 2.3.0, and KSDEP beta 1 ; this feature vector was used because the kernels with these parsers were shown to be effective for PPI extraction by Miwa et al. 2008. Both parsers were retrained using the GENIA Treebank corpus provided by Kim et al. 2003.

Figure 2: Extraction of feature vector from target senetence

Figure 2 summarizes the way in which the feature vector is constructed. The system extracts Bag-of-Words (BOW), shortest path (SP), and graph features from the output of two parsers. The output is grouped according to the feature-type and parser, and each group of features is separately normalized by the L2-norm. Finally, all values are put into a single feature vector, and the whole feature vector is then also normalized by the L2- norm. The features are constructed by using predicate argument structures (PAS) from Enju, and by using the dependency trees from KSDEP.

Features

Bag-of-Words (BOW) Features

The BOW feature includes the lemma form of a word, its relative position to the target pair of proteins (Before, Middle, After), and its frequency in the target sentence. BOW features form the BOW kernel in the original kernel method.

Shortest Path (SP) Features

SP features include vertex walks (v-walks), edge walks (e-walks), and their subsets on the target pair in a parse structure, and represent the connection between the pair. The features are the subsets of the tree kernels on the shortest path. A v-walk includes two lemmas and their link, while an e-walk includes a lemma and its two links. The links indicates the predicate argument relations for PAS, and the dependencies for dependency trees.

Graph Features

Graph features are made from the all-paths graph kernel proposed by Airola et al. 2008. The kernel represents the target pair using graph matrices based on two subgraphs, and the graph features are all the non-zero elements in the graph matrices. The two subgraphs are a parse structure subgraph (PSS) and a linear order subgraph (LOS). Each subgraph is represented by a graph matrix as follows:

where is a label matrix, is an edge matrix, represents the number of vertices, and represents the number of labels. For more information on this feature and how it is computed one can refer to Airola et al. 2008.

Corpus Weighting for Mixing Corpora

The following corpus were used: LLL, AIMed, BioInfer, HPRD and IEPA. In order to draw useful information from the source corpora to get a better model for the target corpus, the authors use SVM-CW, which has been used as a Domain Adaptation method. Given a set of instance-label pairs and , the following optimization problem is solved:

where is a weight vector, is a loss function, and and are the numbers of source and target examples respectively. and are penalty parameters. A squared hinge loss is used: . The problem, excluding the second term, is equal to L2-SVM. The problem can be solved using the DCD method. As an ITL method, SVM-CW weights each corpus, and tries to benefit from the source corpora, by adjusting the effect of their compatibility and incompatibility.

Experimental Result

Evaluation on Single Corpus

Table 1

Using the feature vector formed in previous sections, the authors applied five different linear classifiers to extract PPI from AIMed: L2-SVM, 1-norm soft-margin SVM (L1-SVM), logistic regression (LR), averaged perceptron (AP), and confidence weighted linear classification (CW). Table 1 indicates the performance of these classifiers on AIMed. AP and CW are worse than the other three methods, because they require a large number of examples, and are un-suitable for the current task. This result indicates that all linear classifiers, with the exception of AP and CW, perform almost equally, when using our feature vector.

Evaluation of Corpus Weighting

Table 2

The authors compare SVM-CW with three other methods: aSVM, SVD-ASO, and TrAdaBoost. For this comparison, they used their feature vector without including the graph features, because SVD-ASO and TrAdaBoost require large computational resources. Table 2, demonstrates the results of the comparison. SVM-CW improved the classification performance at least as much as all the other methods. The improvement is mainly attributed to the aggressive use of source examples while learning the model. Since aSVM transfers a model, and SVD-ASO transfers

an additional feature space, aSVM and SVD-ASO do not use the source examples while learning the model. In addition to the difference in the data usage, the settings of aSVM and SVD-ASO do not match the current task. As for aSVM, the DA assumption (that the labels are the same) does not match the task. TrAdaBoost uses the source examples while learning the model, but never increases the weight of the examples, and it attempts to reduce their effects.

Related papers

Bunescu et al., 2005 proposed a graph kernel based approach for the automated extraction of PPI from scientific literature. Miwa et al., 2008 proposed the use of multiple kernels using multiple parsers.