Python for Bioinformatics: SARS-CoV-2 pathogenicity

A popular narrative about the 1918 H1N1 influenza pandemic is that the virus caused a milder illness in the spring of 1918, but a much more severe one in the fall, with a W-shaped fatality curve, killing young adults particularly well. People worry about viruses mutating to become better killers.

It is important to note that there is no molecular evidence to support this hypothesis about flu, partly because the spring events certainly included some cases of the then-standard flu. Some people disagree. You can read about it here.

Now (April 2020), people are concerned that the novel Coronavirus (SARS-CoV-2) might mutate into a more pathogenic strain, although both from experience and on general principles it is much more likely that with time its virulence will decrease. There are certainly exceptions that have maintained virulence (smallpox).

Samples of SARS-CoV-2 from around the world have by now accumulated thousands of different mutations, and these different phylotypes can be assembled into trees. As someone who has studied phylogenetics I find it particularly cool that time can inform the construction of the tree.

Those viruses of Wuhan -> Europe origin have particular mutations contributed by founding lineages, but as far as I know, at this time there is no evidence to support any physiological differences for any branches in the virus phylogenetic tree. There is no basis for talking about different virus strains.

There is no evidence for mutation of SARS-CoV-2 to increased pathogenicity, at least as yet.

Then, there is this one paper that was posted on medrxive the other day (4-14-2020): [pdf].

It got a write-up in the South China Morning Post.

"Professor Li Lanjuan and her colleagues from Zhejiang University found within a small pool of patients many mutations not previously reported. These mutations included changes so rare that scientists had never considered they might occur."

It's getting breathless play on Fox as of two days ago, although they are fundamentally confused about just what was reported.

Preprint servers like medrxive are useful, but none of the work is peer-reviewed and in this case, with a hot topic and a not-so-hot paper, it can contribute to confusion and misunderstanding.

So let's take a look. The title is:

Sars-CoV-2 has acquired mutations capable of substantially changing its pathogenicity

Samples were taken from 11 PCR+ patients near Hangzhou. Sample dates were 1-22 to 2-4.

Filtered samples (swabs, saliva, or stool) grew SARS-CoV-2 on Vero cells after 4-5 days (Vero cells are African green monkey kidney epithelial cells, which grow very well). Virus particles were collected by centrifugation and then filtered. Each pool originates from a different patient sample.

Pools (one per patient) were sequenced (Novaseq 6000). In total, 31 mutations were identified, compared to the type strain (ref seq). There is a lot of discussion of the mutations observed, but without evidence of a physiological effect there is not much point in discussing them further.

One critical point is that the viruses were not pure samples, they are mixtures. Because of high mutation rates, in general any culture of bacteria, or viruses that infect bacterial or eukaryotic cells, is a mixture genetically. However, this means that the sequences cannot be reliably tied to the effects of the virus pools. For a bacterium or a bacterial virus we would isolate single colonies or virus plaques before beginning any experiment. For a eukaryotic virus, a different technique would be used, but it is not difficult. The only reason it wasn't done here is that it takes time.

The pools were then used to infect Vero cells. The scientists measured virus growth dynamics by qRT-PCR.

Let's just take a look at Fig 3A:

Ct refers to the PCR cycle during which specific product is first detected (threshold cycle). Lower Ct means higher initial concentration, and assuming nearly 100% efficiency for the amplification cycle, a difference of 1 cycle is a factor of nearly 2 in virus titer in the original inoculum.

There are 11 samples analyzed in duplicate, with a line drawn through the successive means of the two replicates at each time point.

The difference in Ct between the highest and lowest samples at early times of infection is about 5 (i.e. 2^5 or 32-fold), while at late times it is as much as 8 (2^8 or 256-fold).

My first thought is that, eyeballing the data, it looks like there is a correlation between the early Ct and the 24-48 hr Ct --- that is, pools with greater titers at the end had higher input at the beginning. In other words, although they probably tried to infect with equal amounts of virus (they do not say), they did not achieve this, and samples that got less virus applied in the beginning had less virus at the end.

Secondly, it appears that 3 pools produce less progeny virus than the rest. A skeptic could argue that rather than identify a super-pathogenic strain, what they've done is isolate several mutants that are attenuated.

There is nothing about how the "wild type" or parental virus would perform in this assay.

One isolate might be exceptional (Z11). This is the only datum that supports the title.

The paper has a lot more speculation about the mutations observed and for changes in the proteins, discussion of where they map on those proteins. The "changes so rare" (Post write-up) is one mutation that changed 3 adjacent nucleotides, which is actually not so rare.

Conclusion: the hype is not supported. This is not a good paper and the defects are not remediable. I would reject it.

And we should absolutely, definitely, clearly not be using such a paper, even if it should survive peer-review, to inform the thinking public about whether there exist distinct SARS-CoV-2 strains with different pathogenicities. Especially, increased pathogenicity. That's just scaremongering.

Thursday, April 23, 2020

SARS-CoV-2 pathogenicity