Next generation genome annotation for eukaryotic pathogens and vectors, using artificial intelligence
Year of award: 2024
Grantholders
- Prof Andrew Jones- University of Liverpool, United Kingdom 
- Dr Kathryn Crouch- University of Glasgow, United Kingdom 
- Prof Daniel Rigden- University of Liverpool, United Kingdom 
- Prof David Roos- University of Pennsylvania, United States 
Project summary
High-quality gene annotation is of vital importance for meaningful interpretation of biological data. Current sequencing technologies permit production of high-quality genome assemblies for virtually any organism, but advances in automated annotation lag, particularly for eukaryotic genomes. While a few organisms benefit from extensive (and expensive) manual curation, most reference genomes are plagued by missing genes, incorrect protein sequence predictions and lack of functional inference. Our team is responsible for VEuPathDB.org, supporting hundreds of parasite, fungal, host and vector genomes, and thousands of associated ‘omics datasets. Our platform and others drive discovery research, but the future of genome biology requires urgent solutions to close the “annotation gap”. We propose to develop automated solutions for: i) accurate gene finding using a novel pipeline based on AlphaFold2, protein disorder and complete domain prediction; ii) predicting new functions for genes, by mining very large heterogeneous, structured data with deep learning; and iii) extracting knowledge on gene function from published literature using domain-specific Large Language Models. These developments are essential for the next phase of post-genomic science, which requires the interpretation of 1000s of genomes to allow humanity to address the emerging health and food security challenges of the 21st century.