You are on page 1of 8

JOURNAL OF COMPUTING, VOLUME 3, ISSUE 11, NOVEMBER 2011, ISSN 2151-9617 HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING WWW.JOURNALOFCOMPUTING.

ORG

48

A Hybrid Ranking Method For Constructing Negative Datasets of Protein-Protein Interactions


Jumail Taliba, Razib M. Othman, Umi K. Hassan and Rosfuzah Roslan
AbstractLack of availability of negative examples in the study of computational Protein-Protein Interaction (PPI) prediction is a crucial problem. This leads to computational methods for creating such examples. Most of these methods rely on the fact that proteins not sharing common information tend not to be interacting. While using this fact as the basis for the selection method for non-PPI pairs may yield a negative dataset with high prediction accuracy, it does come with more bias as it is too selective. Other methods simply use random selection as an alternative for fair selection. However, these approaches do not guarantee the prediction accuracy. A method for constructing non-PPI datasets named AIDNIP is proposed. It is a hybrid of the above approaches. Thus, it can reduce biases of selection, while maintaining prediction accuracies. When compared to the existing methods using a Support Vector Machine-based PPI predictor, the proposed method performs better in several metrics investigated in this study. The Perl code and data used in this study are publically available at https://sites.google.com/a/fsksm.utm.my/aidnip. Index Terms Non-Interacting Protein Pairs, Negative Datasets, Protein-Protein Interactions Prediction.

1 INTRODUCTION
ROTIEN-PROTIEN Interactions (PPIs) play an important role in many biological systems. This draws much attention from the bioinformatics community to devise efficient computational methods for predicting such interactions. Recently, the trend is towards the development of machine-learning based methods. Many studies have reported that such methods result in better performances [7,9,11,27,33]. However, the reliability of such methods is arguable as they are highly dependent on the knowledge used to train their models. Negative datasets have strongly affected the performance of the machine-learning methods [7,33]. They are often constructed from hypothetical pairs, in which case the predictors are most probably trained with mingling false information. Despite the debatable use of hypothetical pairs, using artificial negative dataset is still widely accepted due to its unavailability. There are two commonly used methods to construct such dataset: the random method [3,8,19,24] and the method based on non-co-associations (NCA) [13,14,22,27]. Often, the initial list of negative examples for both methods is derived by choosing hypothetical pairs that do not appear in positive databases. The list is treated as a theoretical population of negative pairs and fed into the selection of negative sample process. In the random selection, each negative pair is given an equal chance to be chosen, making the selection considerably fair. However, the randomization process tends to yield a uniform sample. In other words, the resulting sample is most likely to have all patterns of negative pairs from its population. This makes the sample more complicated and leads to difficulties in classifications. Furthermore, the random method has another potential drawback in which the chosen list could be contaminated with

positive pairs [10,12,13,19,29]. The main reason is that annotation of PPIs is an ongoing process; therefore the chosen list may have potential unverified positive pairs. An NCA-based method applies a broader perspective of PPIs in which even when there is no physical interaction but exhibits a co-association. Interacting proteins tend to be localized in the same cellular compartments [5,6,20,23] and perform similar general functions [20,23,30]. Moreover, many proteins that belong to the same biological pathways also share the same functional relationship [26,29] and proteins in the same complex are most likely to interact to each other [2,25,30]. Because the degree of co-association varies, the point at which proteins are considered not interacting is unclear. Researchers cope with this uncertainty by constraining the coassociation at certain levels. Sun et al. [27] constrained the co-association at broad functional categories; considering a pair negative if the corresponding proteins were from different orthology groups at the first level. On the other hand, Guo et al. [11] paired two proteins from different sets of cellular localization for a negative pair, while Qi et al. [22] imposed a constraint that a pair was not interacting if its proteins did not involve in the same complex or sub-complexes. However, an extensive investigation by Ben-Hur and Noble [4] pointed out that imposing the constraint of non-co-association for negative tends to yield biased distributions and lead to over-optimistic estimates of prediction accuracy. They demonstrated that such a constraint makes it easier for a classifier to discriminate between interacting and non-interacting protein pairs. The effect gets stronger as the constraint becomes stricter (i.e., with smaller co-association cutoffs). This is because the distribution of negative dataset becomes less complicated as the dataset is taken at smaller cutoffs and

JOURNAL OF COMPUTING, VOLUME 3, ISSUE 11, NOVEMBER 2011, ISSN 2151-9617 HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING WWW.JOURNALOFCOMPUTING.ORG

49

consequently makes the learning process much easier. Finally, a method to construct negative datasets for use in PPI predictions named AIDNIP is proposed. The key idea of the proposed method is complementing the random and NCA-based methods to each other. This is achieved by utilizing two types of scores for ranking. Through the integration, the method is capable of harmonizing the selection biases and prediction accuracies. The performance of AIDNIP is validated using a Support Vector Machines (SVM)-based PPI prediction method with the Gaussian kernel function, in conjunction with domain profiles as feature vectors. The validations demonstrated that the negative datasets created from AIDNIP perform better than the datasets obtained by existing methods.

tified by root mean squared deviation as follows:

Bias

(s
i 1

pi ) 2 c

(1)

2 METHODS
2.1 Data Sources
In this study, the focus is on PPIs in E. Coli K12. The gold standard positive dataset is downloaded from the Database of Interacting Proteins (DIP) database [31]. Loop pairs were excluded to avoid over-representation by certain proteins. The resulting list contained 5,561 physically PPI pairs of 1,383 proteins. Several types of protein genomic features were obtained from the databases of EchoBase [21] for cellular localizations, EcoCyc [18] for protein complexes, and KEGG [16] for pathways and functional associations. In the EchoBase database, all the available types of localization were extracted: cytoplasmic, inner membrane, periplasmic, outer membrane, and extracellular. These files were downloaded; protcplxs.col from the EcoCyc database and eco_ko.list and eco_pathway.list from KEGG. For the purpose of validation, pair vectors for the SVM-based PPI predictor were constructed by concatenating the domain profiles of both proteins of the respective pairs. A profile is a string of binaries indicating the presence (binary ones) or absence (binary zeros) of a particular domain in a protein [1]. The domain information was gathered from the KEGG database where the file eco_pfam.list was parsed and domains that do not exist in any of the proteins of interest were removed. Following this, 1,852 domains for each protein profile were collected.

where si and pi are the frequencies of pairs of the sample and population, respectively, that fall into the class i, and c is the number of classes. A smaller value of this metric indicates a better quality of negative dataset. Note that, a class is a group of pairs whose interaction scores falls within a certain range. The range size is the same for all classes. The theoretical population of negative pairs is estimated by pairing all proteins employed in this study and then excluding the interacting protein pairs. Following this, ~1 million negative pairs were obtained. Cleanness: This metric is intended to describe the contamination level of potential interacting pairs existing in a negative dataset. The less the dataset contains interacting pairs, the cleaner it is. The cleanness of the dataset is measured as the fraction of negative pairs in the dataset and given by the following equation:
Cleanness n |d |
(2)

where n is the number of true negative pairs and |d| is the size of the negative dataset.

2.3 The Proposed Method


The capabilities and weaknesses of existing methods, the random selection and NCA-based methods have inspired this study to blend these methods together in order to achieve better performances. However, how they are to be combined presents a significant challenge. A method for sampling negative pairs named AIDNIP is developed with the aim at reducing selection biases while maintaining prediction accuracies. It is a hybrid ranking method that employs two scoring types, interaction scores and uniform scores. The former scoring type is meant to make the proposed method behaves like an NCA-based method, so as to obtain higher prediction accuracy. Whereas, the latter is to give each pair a uniform probability of selection in order to achieve the effect of a random selection. Based on both scorings, two ranking types termed C and R ranks, respectively, are derived. The detail of the method is described in the following sections.

2.2 Quality Metrics


The quality of a negative dataset was focused in terms of its prediction accuracy, selection bias and cleanness. These metrics are described below. Accuracy: This metric aims to quantify how well the negative dataset helps in predicting PPIs and is termed as an ROC score, the area under the receiver operating characteristic curve (i.e., the graph that plots true positive rate as a function of false positive rate). Selection Bias: This metric measures the fairness of a method at selecting pairs for a negative dataset. A fair dataset should be chosen at random such that each pair has the same chance to be selected. The selection bias is quan-

2.3.1

Obtaining C Ranks from Interaction Scores

The first step in the proposed method is to rank each pair on their capabilities of predicting PPIs. Regarding this, each pair must be assigned a prediction score which can be estimated by the co-association between proteins [2,4,5,20,26]. The co-association levels can directly be calculated from shared genomic features, such as functions, localizations, complexes and pathways, between respective proteins. However, in the proposed method, the co-

JOURNAL OF COMPUTING, VOLUME 3, ISSUE 11, NOVEMBER 2011, ISSN 2151-9617 HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING WWW.JOURNALOFCOMPUTING.ORG

50

associations are estimated from the levels of interactions between the proteins. This idea is based on the fact that interacting proteins tend to be co-associated [6,23,25,29]. Using the probability theory, the interaction score of each pair is estimated from the physical interaction network of the DIP database [31]. Initially, the chance of each protein to be involving in interacting pairs is calculated as follows:

all the other classes need to be grouped and re-ordered. Each pair is then assigned an R value, the rank of the pair according to its Uniform score. The pair whose the smallest score is assigned an R value of 1, while pairs with higher scores are assigned higher R values accordingly.

2.3.3

Integrating the Rankings

p ( P X ) p ( P | X ) p ( X ),

(3)

The final ranking score for each pair is obtained by combining the normalized values of both ranking types. The equation for the ranking score is as follows:

where P is any event of selecting a positive pair, and X is any event of selecting a pair whose protein X. Both probabilities at the right side are obtained by:
p( X ) 2

Rank

C R Cmax Rmax

(8)

(4)

where Cmax and Rmax are the largest value of each ranking, respectively. Pairs for negative dataset are taken based on the Rank values such that pairs with smaller values have higher chances to be selected.

p( P | X )

PX

(5)

2.3.4

The Algorithm

where and PX are the number of proteins and the number of positive pairs consisting of protein X, respectively. Finally, the interaction score of a pair formed of protein X1 and X2 is taken from the sum of probabilities of the proteins and expressed by:

Interaction p( P X 1 ) p( P X 2 )

(6)

In order to reduce the complexity of the ranking process, the calculation of C and R ranks are placed in the same iteration instead of executing them in their own loops, as depicted in Fig. 1. The algorithm treats C and R as auxiliary counters running interchangeably. At line 6, only R moves as the current and previous pairs have the same Interaction score (i.e., they are in the same class of C). This procedure is meant for ranking pairs within the same classes. Whereas for the lines 7 to 11, it implies that as the
1: CalculatetheInteractionscoreforeachpair 2: SortthepairlistatInteractionscoresascendingly 3: InitializeCandRto1 4: foreachpairinthesortedlist 5: AssignthepairCandR 6: iftheInteractionscoresofthepairandthatofits predecessorarethesame,increaseRby1 7: otherwise,dothefollowings: 8: CalculatetheUniformscorewiththecurrentC 9: SortthepairsattheirUniformscores 10: IncreaseCby1 11: ResetRto1 12: endfor 13: NormalizeallCandRranks 14: CombineboththenormalizedranksfortheRankscores

After assigning pairs interaction scores, they are ranked at these scores, termed as C rank. The higher the score is, the lower the rank of the corresponding pair, meaning the higher the chance the pair has to be selected for negative dataset. The highest ranked pair is the pair with the smallest Interaction score and its C rank value is 1. Pairs of the same score class are given the same rank.

2.3.2

Obtaining R Ranks From Uniform Scores

The next step is to assign each pair another rank type. The purpose of the second type is to achieve a random-like selection. Data with random probability tend to be in a uniform distribution. The main idea is to assign each pair with a score that theoretically came from a uniform distribution. The reason behind this is to reduce selection bias by giving each pair an equal chance of selection. In particular, each pair is initially classified at their C rank. Following this, for each score class, pairs belonging to the same class are ranked in an incremental order. Finally, the uniform score of a pair is taken from the normalized value of its ranking and expressed as:

Fig. 1. The illustration of the ranking algorithm.

Uniform

rc c

current pair is in a different class, C needs to move to the next class while R resets. In other words, pairs with different Interaction scores are put into different classes. Once each pair has their C and R values, both C and R ranks need to be normalized and combined to each other for the final ranks (lines 13 to 14).

(7)

3 RESULTS AND DISCUSSIONS


This section describes the performance evaluation of the proposed method in creating dataset of non-PPI pairs. The method is benchmarked against existing selection methods, the method based on protein co-associations

where rc is the order number of the corresponding pair in class c and |c| is the size of the class. As the initial ranking only ranks pairs within the same class, all pairs from

JOURNAL OF COMPUTING, VOLUME 3, ISSUE 11, NOVEMBER 2011, ISSN 2151-9617 HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING WWW.JOURNALOFCOMPUTING.ORG

51

and a random selection method. As for the former approach, in order to achieve more comprehensive comparisons, four types of score were utilized, named cofunction, co-localization, co-complex and co-pathway and the methods are termed as NCF, NCL, NCC and NCP, respectively. The co-association scores are calculated as the fractions of shared information between respective proteins, X1 and X2, for coresponding genomic features (i.e., functions, localizations, complexes and pathways, respectively), as expressed by:
Coassociation X1 X 2 X1 X 2
(9)
Fig. 2. Comparison of selection bias between different selection methods on different co-association types. The calculation for applying the method NCF over the dataset Localization is given as an illustration follows: NCF uses co-function scores to select the negative pairs and then co-localization scores are utilized for the calculation of bias. For each NCA-based method, it is not applicable to the dataset whose the same co-association type. For example, the NCF method was not employed on the dataset Function.

For each NCA-based selection method, the negative dataset was then created by selecting pairs with zero coassociation scores. The performance of each negative dataset is evaluated by employing it as negative examples to the prediction of PPIs using an SVM-based method. Particularly, the software package SVMLight [15] was employed with the Gaussian kernel. The feature vectors were constructed from protein domain profiles. Both dataset of positive and negative examples were partitioned into equal-sized training and testing sets. Additionally, the benchmarking focuses on three aspects: how fair are the methods in selecting non-PPI pairs for the negative dataset, how well the datasets are at helping the prediction of PPIs, and how clean are the datasets. The following sections describe the details of each evaluation.

3.1 Selection Bias


An important aspect in any selection method is that the sampled dataset should be able to represent its population. In other words, all items in the population have the same chances to be selected so that the sampling is not biased. This is the basis of the evaluation to measure the fairness of each method in constructing negative dataset. The theoretical population of non-PPI pairs was sampled with the selection methods benchmarked in this study. The selection biases were then measured by comparing the distributions of the resulting samples to that of the population. The biases are calculated by (1). Fig. 2 shows the selection biases yield by the methods over four coassociation types. One should note that it is meaningless to compare biases between datasets as their scales are different. The intention is to benchmark several selection methods on a particular dataset at a time and repeated over several datasets. Using different types of datasets promotes fairer evaluations. A random sampling has a major advantage of reducing biases due to its behavior giving equal chances to all candidates to be selected. Thus, a random method can be a reference for benchmarking methods in terms of biases. As illustrated in Fig. 2, the Random method generally shows low biases in all datasets. These observations tally with the aforementioned fact. Although the Random method already results in fair selections, the proposed method, AIDNIP, even yields better results. Sampling by

AIDNIP has significantly reduced the biases as can be seen in all datasets. In fact, there are great reduction of biases in three of the datasets (i.e., except in the Complex dataset). The improvement of AIDNIP compared to the Random method is achieved through the R ranking. The ranking keeps the distribution of the selected list to fit its population. The Random method does not employ any constraints but only gives each pair equal probability. As for the NCA-based methods, they are generally more biased and indeed majority of such methods investtigated in this study are among the most one. This is due to their selectiveness behavior such that they enforce nonco-association constraints for the selection of non-PPIs. As a result, the resulting list only covers particular types of pairs (i.e., those pairs with low co-association scores) and neglects high score non-PPI pairs. Of four NCA-based methods, the NCC and NCP methods were observed the worst. This finding suggests not to using shared complexes or pathways between proteins as the constraints for the selections of non-PPIs. On the other hand, the NCL method shows superiority among the NCA-based methods. The method also performs better than or at least at par with the Random method in all datasets. This observation implies that creating negative dataset of PPIs using an NCA-based method should be more appropriate with co-localization features. Although selection bias by the NCL method is the least one, the method still cannot outperform AIDNIP.

3.2 Reliability Of Non-Interacting Pairs


A general assumption for non-PPIs is that their proteins are most likely to have different functional associations. Lower co-association scores between proteins indicate a higher reliability for negative examples. With this regard, we estimated the reliability of a dataset by calculating the average co-association score over all pairs in the dataset based on different genomic features. As annotations of some proteins may be incomplete, the average of available complete pairs is taken. A control dataset (denoted as Positive) is created

JOURNAL OF COMPUTING, VOLUME 3, ISSUE 11, NOVEMBER 2011, ISSN 2151-9617 HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING WWW.JOURNALOFCOMPUTING.ORG

52

TABLE 1 COMPARISON OF RELIABILITY OF NEGATIVE DATASET CREATED BY DIFFERENT METHODS.


Negative Dataset Positive Random AIDNIP NCF NCL NCC NCP Function 0.165 0.129 0.122 0.067 0.071 0.079 0.081 Averagesimilarityscore Localization 0.873 0.719 0.644 0.623 0.493 0.498 0.514 Complex 0.135 0.061 0.042 0.026 0.021 0.007 0.006 Pathway 0.172 0.135 0.124 0.102 0.096 0.089 0.055 PositiveDatabase Fractionof HighConfident STRING IntAct NonPPIs 6% 22.6% 90.4% 27% 31% 22% 80% 27% 18% 1.2% 0.8% 0.8% 0.7% 1.4% 1.1% 0.2% 0.1% 0.4% 0.1% 0.4% 0.6%

which comprises the list of interacting protein pairs from the DIP database. As expected, the Positive dataset shows the highest similarity score in all score types (Table 1). The Random dataset yields high similarity scores in all categories implying that it selects not only negative examples but also positive ones. The negative dataset obtained by the NCA-based methods seem to have lower similarity scores (i.e., more reliable). This is indeed the effect of enforcing the non-co-association constraint. The negative dataset from AIDNIP yields lower scores compared to the Random method in all categories, indicating that AIDNIP is more reliable. Table 1 also shows that the similarity scores of the dataset from AIDNIP is at par with that of most of the NCA-based methods, supporting the reliability of AIDNIP. The reliability benchmarking was also carried out by taking the existence of high confident non-PPIs in the selected list into account. Ideally, high confident non-PPIs are those pairs with confirmed evidences. However, as such pairs are unavailable, a generous assumption was made, in which such pairs can be obtained from those proteins not showing any similarities at all. Table 1 shows the negative dataset from AIDNIP is generally more reliable than most of the other dataset (with the exception of NCL) as it contains more non-PPIs with high confidence. Further comparisons were conducted by employing two established positive databases, the STRING [28] and IntAct [17], to see how many PPIs in the negative datasets. As can be seen in Table 1, the negative dataset from AIDNIP shows low rates in both databases suggesting that the reliability of the dataset is justified. These rates are also at par with that of the NCA-based methods indicating that AIDNIP has the capabilities of selecting protein pairs with low interaction signals.

of interacting pairs in the respective constructed negative dataset. Due to unavailability of noised dataset (i.e., the list of non-PPIs containing PPIs), such dataset were created by injecting a fraction of positive pairs into a list of negative pairs. These datasets were treated as theoretical populations of non-PPIs for use in the evaluations in this section. Each evaluation was conducted by feeding an initial list of negative examples with confirmed positive pairs, followed by applying the corresponding negative selection method to the list. The selected list is then evaluated for its cleanness by (2). The same procedure was employed with the same

Fig. 3. Cleanness of negative dataset obtained from different selection methods.

3.3 Cleanness of Negative Dataset


Negative datasets for use in the prediction of PPIs must be clean to ensure reliability and level of confidence. A dataset is ideally clean when it only consists of non-PPI pairs (i.e., there is no contamination from positive pairs at all). However, obtaining contamination free datasets is impossibility due to noise in the sampled data. Thus, the level of cleanness is estimated by quantifying the fraction

theoretical population of negative examples over several methods investigated in this study, over different contamination rates ranging from 5 to 40 percent. The result shows that the Random method and most of the NCAbased methods are very much dependant on the contamination rates (Fig. 3). AIDNIP and the NCL method seem to be consistent over different contamination rates. This implies that they can filter out potential positive pairs better than the others regardless of the contamination levels.

3.4 Prediction Efficiency


A good dataset of non-PPIs is the one that aids the classification of interacting and non-interacting pairs. Accordingly, in this section the effectiveness of each selection methods is benchmarked by measuring how well the

JOURNAL OF COMPUTING, VOLUME 3, ISSUE 11, NOVEMBER 2011, ISSN 2151-9617 HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING WWW.JOURNALOFCOMPUTING.ORG

53

studies were more careful in choosing negative examples by restricting them to pairs whose proteins were non-coassociated. Such a restrictive selection was able to obtain high quality negative dataset in terms of prediction accuracies and cleanness. However, this method was deemed to be more biased due to its restrictiveness. The pros and cons from the two common selection methods for negative pairs aforementioned have inspired this study to complement them. In this paper, a hybrid approach of selecting non-PPI pairs named AIDNIP is introduced. Hybrid denotes that it consists of two scoring types, uniform scores and co-association scores, mimicking for both the random selection method and the method based on protein co-association, respectively. The major advantage of AIDNIP is that its capability to create unbiased samples without sacrificing prediction accuracies. AIDNIP is benchmarked against the existing methods by measuring the performances of created negative dataset in terms of several quality metrics. Generally, AIDNIP performs better than other compared methods. In terms of cleanness and prediction accuracy, AIDNIP performs the best. It is also proven that, the selection bias of AIDNIP is at par with the Random selection and much fairer than the NCA-based methods. The reliability of negative dataset from AIDNIP is also justified as the co-association scores between proteins is small. Therefore, AIDNIP is useful particularly in assisting researches in predicting PPIs Fig. 4. Comparisons of prediction efficacy in terms of ROC over dif- based on machine-learning. It can be utilized as a supferent negative dataset in predicting PPIs porting tool for obtaining more reliable and concrete reFig. 4 shows that the negative dataset from the NCA- sults. based methods in general yield high prediction accuracy, whereas the prediction accuracy from the Random me- ACKNOWLEDGMENT thod is the worst. The main reason for the poor perfor- This work was funded by the Ministry of Higher Educamance of the Random method is that the created dataset tion Malaysia under the Academic Program Scheme. most likely contains all possible pairs, making it more complicated, thus leading to a more difficult classifica- REFERENCES tion. On the other hand, the dataset obtained by AIDNIP is superior at helping the prediction of PPIs. The best per- [1] H. Alashwal, S. Deris, and R. M. Othman, Oneclass support vector machines for proteinprotein interactions prediction, formance shown by AIDNIP is in fact achieved through Inter.J.ofBiomedicalSciences,vol.1,pp.120127,2006. the use of C ranking that puts negative pairs with closely related co-association scores into similar classes, and [2] Z.BarJoseph,G.K.Gerber,T.I.Lee,N.J.Rinaldi,J.Y.Yoo,F. Robert, D. B. Gordon, E. Fraenkel, T. S. Jaakkola, R.A. Young, therefore leading to a better classification. negative dataset constructed from the method helps the PPI predictor. The comparisons were done based on the ROC metric which represents the prediction efficiency. Theoretically, the negative dataset selected by an NCAbased method has better performance as such a method only selects particular sets of pairs (i.e., those with low interaction signals), leading to easiness of classification. The result in this evaluation is in line with this assumption (Fig. 4).

4 CONCLUSION
Using computational methods for predicting PPIs is widely accepted in the Bioinformatics community. However, the successfulness of such methods is very much dependant on the knowledge fed to their models. One of the major hurdles faced by these methods is that obtaining the knowledge of non-interacting protein pairs as such pairs are unavailable. Often, researchers end up with employing a computational method for creating such pairs. Many past studies concerning prediction of PPIs did not pay much attention to the preparation of negative examples as their focuses were mainly on PPIs. They usually employed a random selection in order to obtain unbiased datasets. However, the resulting negative list did not guarantee prediction accuracy, and, could have a higher contamination of potential positive pairs. Other
[3]

[4]

[5]

[6]

[7]

and D. K. Gifford, Computational discovery of gene modules and regulatory networks, Nat Biotechnol, vol. 21, pp. 133742, Nov2003. A. BenHur and W. S. Noble, Kernel methods for predicting proteinproteininteractions,Bioinformatics,vol.21Suppl1,pp. i3846,Jun2005. A.BenHurandW.S.Noble,Choosingnegativeexamplesfor the prediction of proteinprotein interactions, BMC Bioinformatics,vol.7Suppl1,p.S2,2006. P. M. Bowers, M. Pellegrini, M. J. Thompson, J. Fierro, T. O. Yeates, and D. Eisenberg, Prolinks: a database of protein functionallinkagesderivedfromcoevolution,GenomeBiol,vol. 5,p.R35,2004. F. Browne, H. Wang, H. Zheng, and F.Azuaje, GRIP:A web based system for constructing Gold Standard datasets for proteinprotein interaction prediction, Source Code Biol Med, vol.4,p.2,2009. X. W. Chen and M. Liu, Prediction of proteinprotein

JOURNAL OF COMPUTING, VOLUME 3, ISSUE 11, NOVEMBER 2011, ISSN 2151-9617 HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING WWW.JOURNALOFCOMPUTING.ORG

54

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

interactions using random decision forest framework, Bioinformatics,vol.21,pp.4394400,Dec152005. R. A. Craig and L. Liao, Phylogenetic tree information aids supervised learning for predicting proteinprotein interaction based on distance matrices, BMC Bioinformatics, vol. 8, p. 6, 2007. A. J. Gonzalez and L. Liao, Predicting domaindomain interactionbasedondomainprofileswithfeatureselectionand support vector machines, BMC Bioinformatics, vol. 11, p. 537, 2010. J. Guo, X. Wu, D. Y. Zhang, and K. Lin, Genomewide inference of protein interaction sites: lessons from the yeast highquality negative proteinprotein interaction dataset, NucleicAcidsRes,vol.36,pp.200211,Apr2008. Y. Guo, L. Yu, Z. Wen, and M. Li, Using support vector machine combined with auto covariance to predict protein proteininteractionsfromproteinsequences,NucleicAcidsRes, vol.36,pp.302530,May2008. G. T. Hart,A. K. Ramani, and E. M. Marcotte, How complete are current yeast and human proteininteraction networks?, GenomeBiol,vol.7,p.120,2006. R. Jansen and M. Gerstein, Analyzing protein function on a genomic scale: the importance of goldstandard positives and negatives for network prediction, Curr Opin Microbiol, vol. 7, pp.53545,Oct2004. R. Jansen, H. Yu, D. Greenbaum, Y. Kluger, N. J. Krogan, S. Chung,A. Emili, M. Snyder, J. F. Greenblatt, and M. Gerstein, A Bayesian networks approach for predicting proteinprotein interactions from genomic data, Science, vol. 302, pp. 44953, Oct172003. T. Joachims, Making largescale SVM learning practical. In Scholkopf,B., Burges,C., Smola,A. (eds), Advances in kernel MethodsSupportVectorLearning.MITPress,pp.169184,1999. M. Kanehisa, M. Araki, S. Goto, M. Hattori, M. Hirakawa, M. Itoh,T.Katayama,S.Kawashima,S.Okuda,T.Tokimatsu,and Y. Yamanishi, KEGG for linking genomes to life and the environment,NucleicAcidsRes,vol.36,pp.D4804,Jan2008. S. Kerrien, Y.AlamFaruque, B.Aranda, I. Bancarz,A. Bridge, C. Derow, E. Dimmer, M. Feuermann, A. Friedrichsen, R. Huntley,C.Kohler,J.Khadake,C.Leroy,A.Liban,C.Lieftink, L. MontecchiPalazzi, S. Orchard, J. Risse, K. Robbe, B. Roechert, D. Thorneycroft, Y. Zhang, R. Apweiler, and H. Hermjakob, IntActopen source resource for molecular interactiondata,NucleicAcidsRes,vol.35,pp.D5615,Jan2007. I.M.Keseler,J.ColladoVides,S.GamaCastro,J.Ingraham,S. Paley, I. T. Paulsen, M. PeraltaGil, and P. D. Karp, EcoCyc: a comprehensive database resource for Escherichia coli, Nucleic AcidsRes,vol.33,pp.D3347,Jan12005. S. L. Lo, C. Z. Cai, Y. Z. Chen, and M. C. Chung, Effect of training datasets on support vector machine prediction of proteinproteininteractions,Proteomics,vol.5,pp.87684,Mar 2005. M. A. Mahdavi and Y. H. Lin, False positive reduction in proteinprotein interaction predictions using gene ontology annotations,BMCBioinformatics,vol.8,p.262,2007. R. V. Misra, R. S. Horler, W. Reindl, Goryanin, II, and G. H. Thomas, EchoBASE: an integrated postgenomic database for Escherichia coli, Nucleic Acids Res, vol. 33, pp. D32933, Jan 1 2005. Y. Qi, Z. BarJoseph, and J. KleinSeetharaman, Evaluation of different biological data and computational classification

[23]

[24]

[25]

[26]

[27]

[28]

[29]

[30]

[31]

[32]

[33]

methodsforuseinproteininteractionprediction,Proteins,vol. 63,pp.490500,May152006. R. Roslan, R. M. Othman, Z.A. Shah, S. Kasim, H.Asmuni, J. Taliba,R.Hassan,andZ.Zakaria,Utilizingsharedinteracting domain patterns and Gene Ontology information to improve proteinprotein interaction prediction, Comput Biol Med, vol. 40,pp.55564,Jun2010. J.Shen,J.Zhang,X.Luo,W.Zhu,K.Yu,K.Chen,Y.Li,andH. Jiang, Predicting proteinprotein interactions based only on sequencesinformation,Proc NatlAcadSci US A,vol.104,pp. 433741,Mar132007. P.Smialowski,P.Pagel,P.Wong,B.Brauner,I.Dunger,G.Fobo, G. Frishman, C. Montrone, T. Rattei, D. Frishman, and A. Ruepp, The Negatome database: a reference set of non interactingproteinpairs,NucleicAcidsRes,vol.38,pp.D5404, Jan2010. M. Strong, P. Mallick, M. Pellegrini, M. J. Thompson, and D. Eisenberg, Inference of protein function and protein linkages in Mycobacterium tuberculosis based on prokaryotic genome organization: a combined computational approach, Genome Biol,vol.4,p.R59,2003. J.Sun,Y.Sun,G.Ding,Q.Liu,C.Wang,Y.He,T.Shi,Y.Li,and Z. Zhao, InPrePPI: an integrated evaluation method based on genomic context for predicting proteinprotein interactions in prokaryoticgenomes,BMCBioinformatics,vol.8,p.414,2007. D. Szklarczyk, A. Franceschini, M. Kuhn, M. Simonovic, A. Roth, P. Minguez, T. Doerks, M. Stark, J. Muller, P. Bork, L. J. Jensen, and C. von Mering, The STRING database in 2011: functionalinteractionnetworksofproteins,globallyintegrated andscored,NucleicAcidsRes,vol.39,pp.D5618,Jan2011. C.vonMering,L.J.Jensen,B.Snel,S.D.Hooper,M.Krupp,M. Foglierini, N. Jouffre, M. A. Huynen, and P. Bork, STRING: known and predicted proteinprotein associations, integrated andtransferredacrossorganisms,NucleicAcidsRes,vol.33,pp. D4337,Jan12005. C. von Mering, R. Krause, B. Snel, M. Cornell, S. G. Oliver, S. Fields,andP.Bork,Comparativeassessmentoflargescaledata sets of proteinprotein interactions, Nature, vol. 417, pp. 399 403,May232002. I.Xenarios,L.Salwinski,X.J.Duan,P.Higney,S.M.Kim,and D. Eisenberg, DIP, the Database of Interacting Proteins: a research tool for studying cellular networks of protein interactions,NucleicAcidsRes,vol.30,pp.3035,Jan12002. N. Zaki, S. LazarovaMolnar, W. ElHajj, and P. Campbell, Proteinproteininteractionbasedonpairwisesimilarity,BMC Bioinformatics,vol.10,p.150,2009. L.V.Zhang,S.L.Wong,O.D.King,andF.P.Roth,Predicting cocomplexedproteinpairsusinggenomicandproteomicdata integration,BMCBioinformatics,vol.5,p.38,Apr162004.

Jumail Taliba is a Lecturer at the Faculty of Computer Science and Information Systems. He received his B.Sc. and M.Sc. in Computer Science from the Universiti Teknologi Malaysia. Currently, he is doing his PhD at the Faculty of Computer Science and Information Systems, Universiti Teknologi Malaysia. His current research interest includes protein-protein interaction prediction and artificial intelligence. Razib M. Othman is a Director of Laboratory of Computational Intelligence and Biotechnology at the Universiti Teknologi Malaysia. He received the B.Sc, M.Sc. and Ph.D. degrees in Computer Science from the Universiti Teknologi Malaysia, in 1993, 2003, and 2008, respectively. His research interests are in the areas of computational

JOURNAL OF COMPUTING, VOLUME 3, ISSUE 11, NOVEMBER 2011, ISSN 2151-9617 HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING WWW.JOURNALOFCOMPUTING.ORG

55

intelligence, computational biology, and software engineering. Umi K. Hassan is a Lecturer at the Department of Computer Science, the Kolej Poly-Tech MARA. She received the B.Sc. and M.Sc. degrees in Computer Science both from the Universiti Teknologi Malaysia, in 2006 and 2009, respectively. Her research interests focus on protein domain prediction and computational biology. Rosfuzah Roslan is an Information Technology Officer at the Department of Information Management, the Malaysian Ministry of Education. She received the B.Sc. and M.Sc. degrees in Computer Science both from the Universiti Teknologi Malaysia, in 2007 and 2010, respectively. Her research interests focus on protein-protein interaction prediction and computational biology.

You might also like