Exploiting The Sequence And Evolutionary Information For The Identification Of Virulence Factors

Authors

  • A Shivram Department of Biochemistry, Pt. Jawahar Lal Nehru Memorial Medical College, Raipur - 492 001, India
  • Piyusha Sharma Department of Biochemistry, Pt. Jawahar Lal Nehru Memorial Medical College, Raipur - 492 001, India
  • Abhigyan Nath Department of Biochemistry, Pt. Jawahar Lal Nehru Memorial Medical College, Raipur - 492 001, India

DOI:

https://doi.org/10.48165/

Keywords:

Deep learning, gradient boosting machines, Random forest, variable importance, virulence factors

Abstract

The pathogenicity of a bacterium depends on its virulence factors (VF); and  predicting the VF allows to understand the pathogenesis of infection. It can be  used to identify a possible point of interference for disease treatment or vaccination. In present work, a machine-learning based prediction model for  (VF’s was developed, using two different types of feature extraction: classical  sequence-based features and evolutionary information-based features for the  representation of VFs and non-VFs protein sequences. For splitting data into  training and testing sets, uniform sampling based approach was used (Kennard Stone algorithm) to create diversified and representative training/testing sets.  For accurate prediction of VF’s, different machine learning algorithms  (Random forest and Gradient boosting machine) and deep learning algorithms  were used and further analysed using model agnostic interpretation methods.  The highest accuracy of 84.5% was obtained by RF with PSSM feature set, followed by GBM with PSSM feature set (accuracy 83.9%). With feature fusion,  the best performance evaluation metric was obtained by GBM with 92.0%  sensitivity, 86.6% specificity, 89.3% accuracy, 0.788 mcc and 0.953 AUC on 10- fold cross validation. The feature importance was captured by Variable  importance plots (VIP) and Shapely plots. Evolutionary information based  features i.e. PSSM and PSE-PSSM are the two most important features in  discriminating the VFs from non VFs. 

Downloads

Download data is not yet available.

References

Altschul, S. 1997. Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Research, 25(17): 3389-3402.

Chen, Z., Zhao, P., Li, F., Leier, A., Marquez-Lago, T.T., Wang, Y., Webb, G.I., Smith, A.I., Daly, R.J., Chou, K.C. and Song, J. 2018. iFeature: A Python package and web server for features extraction and selection from protein and peptide sequences. Bioinformatics, 34(14): 2499-2502.

A. Shivram et al.

Chien, J. 2019. Deep Neural Network. Elsevier EBooks, 259-320. [https://doi.org/10.1016/b978-0- 12-804566-4.00019-x].

Chou, K. 2009. Pseudo amino acid composition and its applications in bioinformatics, proteomics and system biology. Current Proteomics, 6(4): 262-274.

Chou, K.C. and Shen, H.B. 2007. MemType-2L: A web server for predicting membrane proteins and their types by incorporating evolution information through Pse-PSSM. Biochemical and Biophysical Research Communications, 360(2): 339-345.

Dehzangi, A., Sharma, A., Lyons, J., Paliwal, K.K. and Sattar, A. 2015. A mixture of physicochemical and evolutionary-based feature extraction approaches for protein fold recognition. International Journal of Data Mining and Bioinformatics, 11(1): 115. [https://doi.org/10.1504/ijdmb.2015.066359].

Gromiha, M.M. 2010. Protein sequence analysis. Protein Bioinformatics, Dec., 2010: 29-62. [https://doi.org/10.1016/b978-8-1312-2297-3.50002-3].

Gupta, A., Kapil, R., Dhakan, D.B. and Sharma, V.K. 2014. MP3: A software tool for the prediction of pathogenic proteins in genomic and metagenomic data. PLoS ONE, 9(4): e93907. [https://doi.org/10.1371/journal.pone.0093907].

Jones, D.T. and Swindells, M.B. 2002. Getting the most from PSI–BLAST. Trends in Biochemical Sciences, 27(3): 161-164.

Kaur, S. and Forster, J. 2013. Virulence, genetics of. Brenner’s Encyclopedia of Genetics, 2: 287-289. [https://doi.org/10.1016/b978-0-12-374984-0.00636-7].

Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., Ye, Q. and Liu, T. 2017. LightGBM: A highly efficient gradient boosting decision tree. Neural Information Processing Systems, 30: 3149-3157.

Kennard, R.W. and Stone, L.A. 1969. Computer aided design of experiments. Technometrics, 11(1): 137-148.

Khanh, Le, N.Q., Nguyen, Q.H., Chen, X., Rahardja, S. and Nguyen, B.P. 2019. Classification of adapt or proteins using recurrent neural networks and PSSM profiles. BMC Genomics, 20 (Suppl. 9): 966. [https://doi.org/10.1186/s12864-019-6335-4].

Liaw, A. and Wiener, M. 2002. Classification and regression by randomForest. R news, 2(3): 18-22. Mohammadi, A., Zahiri, J., Mohammadi, S., Khodarahmi, M. and Arab, S.S. 2022. PSSMCOOL: A comprehensive R package for generating evolutionary-based descriptors of protein sequences from PSSM profiles. Biology Methods and Protocols, 7(1): 18-22.

Nath, A. and Subbiah, K. 2015. Maximizing lipocalin prediction through balanced and diversified training set and decision fusion. Computational Biology and Chemistry, 59: 101-110. Sachdeva, G., Kumar, K., Jain, P. and Ramachandran, S. 2004. SPAAN: A software program for prediction of adhesins and adhesin-like proteins using neural networks. Bioinformatics, 21(4): 483-491.

Sharma, A.K., Dhasmana, N., Dubey, N., Kumar, N., Gangwal, A., Gupta, M. and Singh, Y. 2016. Bacterial virulence factors: Secreted for survival. Indian Journal of Microbiology, 57(1): 1-10. Tanwer, P., Kolora, S.R.R., Babbar, A., Saluja, D. and Chaudhry, U. 2020. Identification of potential therapeutic targets in Neisseria gonorrhoe by an in-silico approach. Journal of Theoretical Biology, 490: 110172. [https://doi.org/10.1016/j.jtbi.2020.110172].

Xie, R., Li, J., Wang, J., Dai, W., Leier, A., Marquez-Lago, T.T., Akutsu, T., Lithgow, T., Song, J., and Zhang, Y. 2020. Deep VF: A deep learning-based hybrid framework for identifying virulence factors using the stacking strategy. Briefings in Bioinformatics, 22(3): [https://doi.org/10.1093/bib/bbaa125].

Published

2023-11-02

How to Cite

Exploiting The Sequence And Evolutionary Information For The Identification Of Virulence Factors . (2023). Applied Biological Research, 25(2), 143–150. https://doi.org/10.48165/