Prediciton of sub-cellular and sub-nuclear localization
Introduction
Prediction of Protein Sub-cellular and Sub-nuclear Localizations.
Assessing protein similarity with Gene Ontology and its use in subnuclear localization prediction.
Zhengdeng Lei and Yang Dai.
BMC Bioinformatics, 7:491, 2006.
Link to PubMed.
Summary
The accomplishment of the various genome sequencing projects resulted in accumulation of massive amount of gene sequence information. This calls for a large-scale computational method for predicting protein localization from sequence. The protein localization can provide valuable information about its molecular function, as well as the biological pathway in which it participates. The prediction of localization of a protein at subnuclear level is a challenging task. In our previous work we proposed an SVM-based system using protein sequence information for this prediction task. In this work, we assess protein similarity with Gene Ontology (GO) and then improve the performance of the system by adding a module of nearest neighbor classifier using a similarity measure derived from the GO annotation terms for protein sequences.
The performance of the new system proposed here was compared with our previous system using a set of proteins resided within 6 localizations collected from the Nuclear Protein Database (NPD). The overall MCC (accuracy) is elevated from 0.284 (50.0%) to 0.519 (66.5%) for single-localization proteins in leave-one-out cross-validation; and from 0.420 (65.2%) to 0.541 (65.2%) for an independent set of multi-localization proteins.
This integrated prediction system can be accessed here
An SVM-based system for predicting protein subnuclear localizations.
Zhengdeng Lei and Yang Dai.
BMC Bioinformatics, 6:291, 2005.
Link to PubMed.
Summary
The large gap between the number of protein sequences in databases and the number of functionally characterized proteins calls for the development of a fast computational tool for the prediction of subnuclear and subcellular localizations generally applicable to protein sequences. The information on localization may reveal the molecular function of novel proteins, in addition to providing insight on the biological pathways in which they function.
The bulk of past work has been focused on protein subcellular localizations. Furthermore, no specific tool has been dedicated to prediction at the subnuclear level, despite its high importance. In order to design a suitable predictive system, the extraction of subtle sequence signals that can discriminate among proteins with different subnuclear localizations is the key. New kernel functions used in a support vector machine (SVM) learning model are introduced for the measurement of sequence similarity. The k-peptide vectors are first mapped by a matrix of high-scored pairs of k-peptides which are measured by BLOSUM62 scores. The kernels, measuring the similarity for sequences, are then defined on the mapped vectors.
By combining these new encoding methods, a multi-class classification system for the prediction of protein subnuclear localizations is established for the first time. The performance of the system is evaluated with a set of proteins collected in the Nuclear Protein Database (NPD). The overall accuracy of prediction for 6 localizations is about 50% (vs. random prediction 16.7%) for single localization proteins in the leave-one-out cross-validation; and 65% for an independent set of multi-localization proteins. The integrated system benefits from the combination of predictions from several SVMs based on selected encoding methods. Finally, the predictive power of the system is expected to improve as more proteins with known subnuclear localizations become available.
A class of new kernels based on a matrix of high-scored pairs of k-peptides and its applications in prediction of protein sub-cellular localization.
Zhengdeng Lei and Yang Dai.
LNCS Transactions on Computational Systems Biology II, Springer-Verlag, pp.48-58, 2005.
Link to PDF
Summary
Many of cellular proteins which participate in a related pathway are compartmentalized in specific regions of the cell. Subcellular localization of a protein is biologically highlighted as a key element in understanding its function. The advances in proteomics and genome sequencing have generated enormous amounts of primary sequences stored in the genome databases. Thus a faster and cheaper bioinformatics tool is required to annotate the exponentially growing data. The prediction of protein sub-nuclear compartments from primary protein sequences may reveal the molecular function of novel proteins, and may also predict the biological pathways in which they function. Current best predictors include PSORTb, CELLO and Proteome Analyst.
The coding schemes for protein sequences based on the conventional k-peptide compositions have been proved effective in conjunction with support vector machines. In this work, we introduced a new SVM kernel. Each k-peptide coding vector is mapped onto a new vector based on a matrix formed by the high BLOSUM62 scores associated with a pair of k-peptides. This matrix, denoted as Dk, has size of 21k × 21k, where 21 is the number of amino acid symbols (normal 20 amino acids plus the special symbol “X”). When k = 1, this matrix is the same as BLOSUM matrix except that the entries with negative values are replaced by zeroes. When k = 2, each entry is the BLOSUM score corresponding to a pair of di-peptides with negative value being replaced by zero. For k = 3, the size of matrix D3 is very large, a threshold is used to keep entries with “high scores”. Therefore, the sparsity of the matrix leads to computation efficiency. The set of proteins from Gram-negative bacteria used in the evaluation of PSORTb was considered in our experiment for the evaluation of the new method. The dataset comprises proteins localized in one of the five localization sites: Cytoplasmic (248), inner membrane (268), periplasmic (244), outer membrane (352), and extra cellular (190). We compared the performance of the new kernel with the corresponding encoding schemes based on the k-peptide compositions. The experiment was carried out with the one-versus-rest multi-classification scheme. More specifically, each time the relevant dataset consisting of the proteins with the specific localization was designated as the positive set, and the remainder of proteins in the other four localizations was denoted as the negative set. The radial basis function was chosen as the kernel function for the mapped encoding vectors Dkxk, since a preliminary experiment has shown such a kernel exhibited better performance. As the size of the positive and negative sets is substantially different, the F-score which combines precision and recall was used to evaluate the performance. The precision, recall, and F-score of the 5-fold cross-validation were computed respectively, and the final results were reported as the average of the values from 5 folds. The computational results showed that the new kernel based methods achieved competitive performance over the conventional k-peptide composition methods. The new method yielded the best performance with an F-score 87% when k = 2. This value is about 10%~13% higher compared with 0.742~0.769 obtained from the k-peptide methods. It is worth noting that the new method yielded a comparable performance compared with the new released PSORTb ver.2.0, which has an overall F-score 0.89. While PSORTb comprises several modules designed for the prediction of specific localization sites, it is surprising that our single module demonstrated competitive ability. This work has demonstrated the superior performance of the new kernel over the conventional k-peptide coding methods in prediction of protein subcellular localization. More investigation needs to be carried out for the optimization of the threshold. Furthermore, this general sequence encoding method can be used to tackle other biological prediction problems.