Scientists have developed a new machine learning method that can accurately predict which animal viruses could go on to infect humans in the future, using only information encoded in the viral genome.

Coronavirus image

Most emerging infectious diseases of humans are caused by ‘zoonotic’ viruses that originate from other animal species. However, of the many millions of viruses that circulate in animals, only a few are likely able to infect humans. Scientists currently have very limited ability to rapidly assess zoonotic risk at the time that viruses are discovered, making it difficult to know which newly discovered viruses should be prioritized for early investigation, and beyond that, outbreak preparedness.

Now, in a new study led by the University of Glasgow and published in PLOS Biology, researchers have developed a new machine learning method for predicting which viruses could infect humans, based entirely on a virus’s genome sequence – often the only thing scientifically known about newly found or poorly-characterised animal viruses.

The new modelling method is able to accurately predict which viruses may infect humans, using only a virus’s genomic sequence, as well as ranking them as low, medium, high, or very high risk. Using the same modelling, and without any prior knowledge of the previous SARS outbreak in humans, this model was also able to accurately predict that SARS-CoV-2, the virus that caused the COVID-19 pandemic, and its closest viral relatives found in animals, had a high risk of being able to infect humans. This finding, together with more formal testing on hundreds of viruses with known zoonotic status, showed that the model makes actionable predictions on a diverse range of RNA and DNA viruses, even those that are entirely new to science.

The new modelling method predicts whether viruses might be able infect humans, but cannot determine how dangerous they may be in terms of either symptoms or epidemic/pandemic potential, nor when they might jump into human populations. Being able to infect humans is the first step towards causing an outbreak, but numerous other factors, such as contact between the reservoir and humans, whether the virus can transmit between humans, and our response to such ‘spillover infections’ will shape epidemic and pandemic risk.

Researchers believe this new modelling method could help scientists better prioritise research efforts on the animal viruses most likely to successfully infect humans, an important step towards future human outbreak preparedness and planning.

Lead author Nardus Mollentze, of the MRC-University of Glasgow Centre for Virus Research, said: “Calls for investment in virus discovery programmes targeting wildlife have been controversial, since it remains unclear how to go from knowing which viruses are out there to outbreak preparedness. Finding out what newly described viruses are capable of, and how to respond to that, requires extensive characterisation in both the lab and in their natural environment, and this characterisation currently cannot keep up with the number of viruses being found. When viruses are first discovered, often all we have is their genome sequence, so developing an accurate machine learning tool that is based on information contained within that should enable us to better understand which animal viruses pose the highest risk, and should therefore be characterised and investigated first.

“Such predictions are still only a first step however. If we want investments in virus discovery to translate into pandemic preparedness, there is a need to develop both higher-throughput virus characterisation methods and further models capable of turning the information generated by these methods into updated risk predictions.”

Senior author Daniel Streicker, from the MRC-University of Glasgow Centre for Virus Research and the University of Glasgow’s Institute of Biodiversity, Animal Health and Comparative Medicine, said: “Identifying high risk viruses amid the vast diversity of animal-infecting viruses that are unlikely to infect humans has been a needle in a haystack challenge. Our new genome-based zoonotic risk assessment represents a step towards solving that challenge and, along with our earlier efforts showing that the reservoir hosts and arthropod vectors of viruses can be predicted from viral genomes, shows that a surprising amount of ecological insight is possible from genome sequences alone, hinting at the existence of poorly understood ways that viruses adapt to their hosts.

“More immediately, since these models use nothing more than genetic sequences, they can be applied at the time that viruses are discovered, creating a rapid, low-cost triage system to decide which viruses merit extra attention.”

Co-author Simon Babayan, Institute of Biodiversity, Animal Health and Comparative Medicine, said: “As most emerging infectious diseases in humans are caused by a small number of viruses that originated in other animal species, it remains an enormous challenge to know where to look for the next virus epidemic. Now we provide a rapid, low-cost approach to enable evidence-driven virus surveillance and characterisation of viruses that could specifically infect humans, and may therefore better help with future epidemic and pandemic preparedness.”

The paper, ‘Identifying and prioritizing potential human-infecting viruses from their genome sequences’ is published in PLOS Biology. The work was funded by the Medical Research Council (MRC) and Wellcome.


Enquiries: ali.howard@glasgow.ac.uk or elizabeth.mcmeekin@glasgow.ac.uk / 0141 330 6557 or 0141 330 4831

First published: 28 September 2021