Exploiting The Sequence And Evolutionary Information For The Identification Of Virulence Factors
DOI:
https://doi.org/10.48165/Keywords:
Deep learning, gradient boosting machines, Random forest, variable importance, virulence factorsAbstract
The pathogenicity of a bacterium depends on its virulence factors (VF); and predicting the VF allows to understand the pathogenesis of infection. It can be used to identify a possible point of interference for disease treatment or vaccination. In present work, a machine-learning based prediction model for (VF’s was developed, using two different types of feature extraction: classical sequence-based features and evolutionary information-based features for the representation of VFs and non-VFs protein sequences. For splitting data into training and testing sets, uniform sampling based approach was used (Kennard Stone algorithm) to create diversified and representative training/testing sets. For accurate prediction of VF’s, different machine learning algorithms (Random forest and Gradient boosting machine) and deep learning algorithms were used and further analysed using model agnostic interpretation methods. The highest accuracy of 84.5% was obtained by RF with PSSM feature set, followed by GBM with PSSM feature set (accuracy 83.9%). With feature fusion, the best performance evaluation metric was obtained by GBM with 92.0% sensitivity, 86.6% specificity, 89.3% accuracy, 0.788 mcc and 0.953 AUC on 10- fold cross validation. The feature importance was captured by Variable importance plots (VIP) and Shapely plots. Evolutionary information based features i.e. PSSM and PSE-PSSM are the two most important features in discriminating the VFs from non VFs.
Downloads
References
Altschul, S. 1997. Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Research, 25(17): 3389-3402.
Chen, Z., Zhao, P., Li, F., Leier, A., Marquez-Lago, T.T., Wang, Y., Webb, G.I., Smith, A.I., Daly, R.J., Chou, K.C. and Song, J. 2018. iFeature: A Python package and web server for features extraction and selection from protein and peptide sequences. Bioinformatics, 34(14): 2499-2502.
A. Shivram et al.
Chien, J. 2019. Deep Neural Network. Elsevier EBooks, 259-320. [https://doi.org/10.1016/b978-0- 12-804566-4.00019-x].
Chou, K. 2009. Pseudo amino acid composition and its applications in bioinformatics, proteomics and system biology. Current Proteomics, 6(4): 262-274.
Chou, K.C. and Shen, H.B. 2007. MemType-2L: A web server for predicting membrane proteins and their types by incorporating evolution information through Pse-PSSM. Biochemical and Biophysical Research Communications, 360(2): 339-345.
Dehzangi, A., Sharma, A., Lyons, J., Paliwal, K.K. and Sattar, A. 2015. A mixture of physicochemical and evolutionary-based feature extraction approaches for protein fold recognition. International Journal of Data Mining and Bioinformatics, 11(1): 115. [https://doi.org/10.1504/ijdmb.2015.066359].
Gromiha, M.M. 2010. Protein sequence analysis. Protein Bioinformatics, Dec., 2010: 29-62. [https://doi.org/10.1016/b978-8-1312-2297-3.50002-3].
Gupta, A., Kapil, R., Dhakan, D.B. and Sharma, V.K. 2014. MP3: A software tool for the prediction of pathogenic proteins in genomic and metagenomic data. PLoS ONE, 9(4): e93907. [https://doi.org/10.1371/journal.pone.0093907].
Jones, D.T. and Swindells, M.B. 2002. Getting the most from PSI–BLAST. Trends in Biochemical Sciences, 27(3): 161-164.
Kaur, S. and Forster, J. 2013. Virulence, genetics of. Brenner’s Encyclopedia of Genetics, 2: 287-289. [https://doi.org/10.1016/b978-0-12-374984-0.00636-7].
Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., Ye, Q. and Liu, T. 2017. LightGBM: A highly efficient gradient boosting decision tree. Neural Information Processing Systems, 30: 3149-3157.
Kennard, R.W. and Stone, L.A. 1969. Computer aided design of experiments. Technometrics, 11(1): 137-148.
Khanh, Le, N.Q., Nguyen, Q.H., Chen, X., Rahardja, S. and Nguyen, B.P. 2019. Classification of adapt or proteins using recurrent neural networks and PSSM profiles. BMC Genomics, 20 (Suppl. 9): 966. [https://doi.org/10.1186/s12864-019-6335-4].
Liaw, A. and Wiener, M. 2002. Classification and regression by randomForest. R news, 2(3): 18-22. Mohammadi, A., Zahiri, J., Mohammadi, S., Khodarahmi, M. and Arab, S.S. 2022. PSSMCOOL: A comprehensive R package for generating evolutionary-based descriptors of protein sequences from PSSM profiles. Biology Methods and Protocols, 7(1): 18-22.
Nath, A. and Subbiah, K. 2015. Maximizing lipocalin prediction through balanced and diversified training set and decision fusion. Computational Biology and Chemistry, 59: 101-110. Sachdeva, G., Kumar, K., Jain, P. and Ramachandran, S. 2004. SPAAN: A software program for prediction of adhesins and adhesin-like proteins using neural networks. Bioinformatics, 21(4): 483-491.
Sharma, A.K., Dhasmana, N., Dubey, N., Kumar, N., Gangwal, A., Gupta, M. and Singh, Y. 2016. Bacterial virulence factors: Secreted for survival. Indian Journal of Microbiology, 57(1): 1-10. Tanwer, P., Kolora, S.R.R., Babbar, A., Saluja, D. and Chaudhry, U. 2020. Identification of potential therapeutic targets in Neisseria gonorrhoe by an in-silico approach. Journal of Theoretical Biology, 490: 110172. [https://doi.org/10.1016/j.jtbi.2020.110172].
Xie, R., Li, J., Wang, J., Dai, W., Leier, A., Marquez-Lago, T.T., Akutsu, T., Lithgow, T., Song, J., and Zhang, Y. 2020. Deep VF: A deep learning-based hybrid framework for identifying virulence factors using the stacking strategy. Briefings in Bioinformatics, 22(3): [https://doi.org/10.1093/bib/bbaa125].