Next generation genome annotation for eukaryotic pathogens and vectors, using artificial intelligence

Year of award: 2024

Grantholders

  • Prof Andrew Jones

    University of Liverpool, United Kingdom

  • Dr Kathryn Crouch

    University of Glasgow, United Kingdom

  • Prof Andrew Jones

    University of Liverpool, United Kingdom

  • Prof Daniel Rigden

    University of Liverpool, United Kingdom

  • Prof David Roos

    University of Pennsylvania, United States

Project summary

High-quality gene annotation is of vital importance for meaningful interpretation of biological data. Current sequencing technologies permit production of high-quality genome assemblies for virtually any organism, but advances in automated annotation lag, particularly for eukaryotic genomes. While a few organisms benefit from extensive (and expensive) manual curation, most reference genomes are plagued by missing genes, incorrect protein sequence predictions and lack of functional inference. Our team is responsible for VEuPathDB.org, supporting hundreds of parasite, fungal, host and vector genomes, and thousands of associated ‘omics datasets. Our platform and others drive discovery research, but the future of genome biology requires urgent solutions to close the “annotation gap”. We propose to develop automated solutions for: i) accurate gene finding using a novel pipeline based on AlphaFold2, protein disorder and complete domain prediction; ii) predicting new functions for genes, by mining very large heterogeneous, structured data with deep learning; and iii) extracting knowledge on gene function from published literature using domain-specific Large Language Models. These developments are essential for the next phase of post-genomic science, which requires the interpretation of 1000s of genomes to allow humanity to address the emerging health and food security challenges of the 21st century.