VirulentHunter: deep learning-based virulence factor predictor illuminates pathogenicity in diverse microbial contexts
Abstract
Virulence factors (VFs) are critical determinants of bacterial pathogenicity, but current homology-based identification methods often miss novel or divergent VFs, and many machine learning approaches neglect functional classification. Here, we present VirulentHunter, a novel deep learning framework that enable simultaneous VF identification and classification directly from protein sequences by leveraging the crucial step of fine-tuning pretrained protein language model. We curate a comprehensive VF database by integrating diverse public resources and expanding VF category annotations. Our benchmarking results demonstrate that VirulentHunter outperforms existing methods, particularly in identifying VFs lacking detectable homologs. Additionally, strain-level analysis using VirulentHunter highlights distinct pathogenicity profiles between Mycobacterium tuberculosis and Mycobacterium avium, revealing enrichment in VFs related to adherence, effector delivery systems, and immune modulation in M. tuberculosis, compared to biofilm formation and motility in M. avium. Furthermore, metagenomic profiling of gut microbiota from inflammatory bowel disease patient reveals a depletion of VFs associated with immune homeostasis. These results underscore the versatility of VirulentHunter as a powerful tool for VF analysis across diverse applications. To facilitate broader accessibility, we provide a freely accessible web service for VF prediction (http://www.unimd.org/VirulentHunter), accommodating protein sequences, genomes, and metagenomic data.