Next generation genome annotation for eukaryotic pathogens and vectors, using artificial intelligence
Year of award: 2024
Grantholders
Prof Andrew Jones
University of Liverpool, United Kingdom
Dr Kathryn Crouch
University of Glasgow, United Kingdom
Prof Andrew Jones
University of Liverpool, United Kingdom
Prof Daniel Rigden
University of Liverpool, United Kingdom
Prof David Roos
University of Pennsylvania, United States
Project summary
High-quality gene annotation is of vital importance for meaningful interpretation of biological data. Current sequencing technologies permit production of high-quality genome assemblies for virtually any organism, but advances in automated annotation lag, particularly for eukaryotic genomes. While a few organisms benefit from extensive (and expensive) manual curation, most reference genomes are plagued by missing genes, incorrect protein sequence predictions and lack of functional inference.
Our team is responsible for VEuPathDB.org, supporting hundreds of parasite, fungal, host and vector genomes, and thousands of associated ‘omics datasets. Our platform and others drive discovery research, but the future of genome biology requires urgent solutions to close the “annotation gap”.
We propose to develop automated solutions for: i) accurate gene finding using a novel pipeline based on AlphaFold2, protein disorder and complete domain prediction; ii) predicting new functions for genes, by mining very large heterogeneous, structured data with deep learning; and iii) extracting knowledge on gene function from published literature using domain-specific Large Language Models. These developments are essential for the next phase of post-genomic science, which requires the interpretation of 1000s of genomes to allow humanity to address the emerging health and food security challenges of the 21st century.