Coronavirus Pathogenicity Clues Uncovered Using Machine-Learning Approach
NEW YORK – A team from the National Library of Medicine, Broad Institute, and Massachusetts Institute of Technology has started tallying the genetic features that distinguish pathogenic coronaviruses — particularly the SARS-CoV-2 virus behind the ongoing COVID-19 pandemic and the Middle Eastern respiratory syndrome-causing MERS-CoV — from less dangerous coronaviruses.
“We were able to identify several features that are not found in less virulent coronaviruses and that could be relevant for pathogenicity in humans. The actual demonstration of the relevance of these findings will come from direct experiments that are currently getting under way,” senior author Eugene Koonin, a biotechnology information researcher at the National Library of Medicine, said in a statement.
For a paper published in the Proceedings of the National Academy of Sciences on Wednesday, the researchers relied on comparative genomics, phylogenetic analyses, and support vector-based machine learning to narrow in on suspicious features shared by the SARS-CoV-2 and MERS-CoV coronaviruses, which they classified as viruses with “high case fatality rate” (high-CFR) coronaviruses. They noted that the machine-learning strategy selected made it possible to pick up differences between these high-CFR viruses and “low-CFR” human coronaviruses that might be missed with genome alignment-based comparisons alone.
“[W]e trained multiple support vector machines across a sliding window to detect regions that confer clean separation between high- and low-CFR virus genomes,” the authors explained. “We evaluated the performance of each [support vector machine] via cross-validation and filtered for genomic regions that significantly distinguish the high- and low-CFR genomes.”
Based on analyses of more than 900 available coronavirus genomes, the team uncovered 11 seemingly distinct sites in the high-CFR SARS-CoV-2 and MERS-CoV genomes, including sequences coding for the nucleocapsid protein and the spike glycoprotein that interacts with host cell receptors.
When they took a closer look at these changes, the researchers saw signs that the high-CFR viruses produce a version of the nucleocapsid protein with an enhanced nuclear localization signal, while the spike protein for the potentially deadly SARS and MERS coronaviruses shared insertions not found in more mild-mannered, low-CFR coronaviruses.
“The enhancement of the NLS in the high-CFR coronaviruses nucleocapsids implies an important role of the sub-cellular localization of the nucleocapsid protein in coronavirus pathogenicity,” the authors suggested, adding that “insertions in the spike protein appear to have been acquired independently by the SARs and MERS clades of the high-CFR coronaviruses, in both the domain involved in virus-cell fusion and the domain mediating receptor recognition.”
While functional studies are needed to dig into the potential connections identified in their new analysis, the authors suggested that the features found so far “could be crucial contributors to coronavirus pathogenicity and possible targets for diagnostics, prognostication, and interventions.”
“These features correlate with the high fatality rate of these coronaviruses as well as their ability to switch hosts from animals to humans,” Koonin and co-authors explained. “The identified features could represent crucial elements of coronavirus virulence and allow for detecting animal coronaviruses that have the potential to make the jump to humans in the future.”