Extending large language AI models to predict virus-host interactions
Supervisors:
Prof David Robertson, School of Infection & Immunity
Prof Craig Macdonald, School of Computing Science
Dr Ke Yuan, School of Computing Science & School of Cancer Sciences
Prof Alfredo Castello, School of Infection & Immunity
Summary:
Viruses interact with specific host molecules and functions to infect cells, generate new virus particles and sustain onward transmission. The interplay between these host-dependency factors, the antiviral host response and virus evasion strategies determines the outcome of infection. However, while we have very high coverage of the virus genomes in terms of sequence data, there is a detrimental lack of experimental virus-host protein-protein interaction datasets. The aim of this project will be to develop novel machine learning methods from natural language processing to tackle this important problem. This will extend recent advances in deep learning transformer models such as AlphaFold and ESM models, already applied successfully for protein structure prediction, to model and predict biomolecular interactions. In the analogy with the use of language models in human language processing, these models can be used both for understanding the relationship among words (amino acids in a protein sequence) and sentences (protein structures). This innovation is similar to comparing different blocks of text to each other but instead using the protein language model retrained to estimate the probability of interaction. Focus will be on important RNA viruses associated with human disease and/or potential for spillover with assessment of potential for therapeutic interventions.