Python for Bioinformatics: April 2020

Thursday, April 30, 2020

South Carolina Covid-19 cases (04-30)

Here is Covid data from Johns Hopkins over the last week + 1, for counties in the state of South Carolina. These are confirmed cases. The two columns at the right are the values for the last two days, subtracting the caseload one week earlier.

Studies indicate that you're infectious for about 7 days, so that's the rationale. Of course you will still need care for a couple of weeks.

We live in Charleston County. Total cases are 441 (that's about 1.1 per 1000).

We're doing much better than Greenville. According to the U.S. Census, Greenville is about 25% larger in population. Both are trending down.

The county with the most cases is Richland. The county seat of Richland is Columbia, which is also the state capital.

Wednesday, April 29, 2020

Phylodynamics

I was chatting over Google Duo with family last weekend and the topic came up of SARS-CoV-2 possibly circulating in the U.S. in late 2019.

Many people report strangely nasty cases of flu. And of course, Rush Limbaugh claimed that all of California had been infected in November, to support the idea that "herd immunity" already exists, and the country just needs to "open up."

I said that the genetic data show that this is impossible.

One person (an M.D.) said: "maybe the genetics is wrong." I was just dumb-founded. The evidence is overwhelming. So I started today on trying to explain why that's so.

There is a guy named Trevor Bedford, his latest blog post is here. I will just summarize what he said and add a few tidbits of my own.

In phylogenetics, we can make trees of related organisms based on various characteristics, often the nucleotide sequence of the organism, or selected genes (when comparing species, protein sequences are even better). These trees show lines of descent from a common ancestor.

We can use the differences between aligned sequences to infer phylogenetic trees. Here's an example.

We don't usually have the ability to sequence ancestors, but in a rapidly mutating virus (or other microbial) population, data can be obtained in a time series. Currently, there are several thousand SARS-CoV-2 sequences along with the dates on which the samples were taken from patients.

These can be assembled into phylogenetic trees, like this one from NextStrain.

This is a screenshot of the whole tree as of today

Here's a link to a phylogeny restricted to the dates 2019-12-19 to 2020-02-06.

Now, I've been out of the game for a while, so I have not actually sat down and studied their methods carefully. I know that holy wars have been fought over "which is the best" method for building trees. Nevertheless, this is a phylogenetic tree which arranges the branches from left to right in inferred order of descent from a common ancestor.

In this particular tree, the branch lengths are based on date of sample. The sequences are colored by geographic origin. So for example, purple in China and red is North America.

Let's just focus on one or two aspects for now.

The red dot right identified in the graphic above has the unwieldy official name 2019/nCoV/USA-WA1/2020.

It is a sequence determined for a sample taken from a guy who lives in Snohomish County, Washington. He went to visit his family in Wuhan in January, returned to the US, and developed symptoms of Covid-19. He was admitted to hospital and test results came back the next day (Jan 20) as Covid-positive.

He is the first individual known to be Covid-positive in the United States. Let's take a look at the sequence. I got the genome for the wild type virus and his virus from the links below.

To compare two sequences, I just used a tiny bit of Python (must search for a web tool).

ref v. WA1
C 8782 T
C 18060 T
T 28144 C

[Update: messed up the first time, since I did it by hand the sequences had newlines which screwed up the numbers].

[Update: according to the paper (pdf), all three of these mutations have been found in virus sequences from patients in China.

So, relative to the "wild type" version of the genome, the WA1 virus carries 3 mutations. These are unlikely to mean anything. (I need to check whether they change the sequences of the proteins encoded by the relevant genes, but it is still nearly certain that they make no physiological difference).

As Trevor writes on the blog, the second positive in Washington occured in February and the sequence was available on Feb 29. It is called 2019/nCoV/USA-WA2/2020.

C8782T and T28144C (found in 74/224 sampled viruses from China)

C18060T (found in 6/224 sampled viruses from China)

So C18060T is quite rare. Yet it is also found in the second virus sample from WA.

ref v. WA2
C 5784 T
C 8782 T
C 17747 T
A 17858 G
C 18060 T
T 28144 C

WA1 v. WA2
C 5784 T
C 17747 T
A 17858 G

What you are looking at is a transmission chain for the virus. Somehow the Snohomish patient transferred the virus to someone else to someone else to someone else and that high school student showed up to be treated and have his virus sequenced 3 weeks later, with no travel history and no known contact. The two patients live 15 miles apart.

They worked hard on Snohomish, tracked down the limo driver and his office mates and so on, and none of them got sick or ever tested positive.

Could this happen by chance? Theoretically, there could be a different traveller who introduced the same virus, but "only 2/59 sequenced viruses from China possess this variant." Very unlikely.

On the other hand, early samples from California indicate independent introductions from China. There doesn't appear to be a CA1 either on NextStrain or NCBI, but there is a CA2.

ref v. CA2
C 17000 T
G 26144 T

The CA2 sample doesn't have the same mutations as WA1. It is not a direct descendant but a cousin.

So there are several basic arguments I can make about the possibility of SARS-CoV-2 circulating earlier than mid-January.

First, the phylogenetics show that the virus emerged in Wuhan, China, late in November or early in December, 2019. All of those virus sequences are very closely related. If the virus had been in humans for more than a couple months the diversity of sequences would be much, much greater.

If the virus had been in California in November and December, the diversity of sequences would also be much greater.

Second, most later cases bear characteristic mutations that indicate a transmission chain back to an introduction in January or February. Tens of thousands of people traveled from Wuhan to other places then, and some of them came to the US. You will recall the travel ban at the end of February. It was too late.

Here is a screenshot from the NextStrain site.

It looks like Snohomish is headed for Nebraska, but that's just because they send the arrows to the labels on geographic locations.

We may not have samples from all the lineages, but the overall pattern is quite clear, and convincing.

wild type RefSeq: NC_045512.2 Wuhan-Hu-1
WA1: Genbank accession MN985325
WA2: Genbank accession MT152824
CA2: GenBank: MN994468.1

[Update: here is an excellent Twitter thread on origins of SARS-CoV-2.

Covid-19 deaths by county

One aspect of the dilemna about "opening up" the country is that many places don't see any problems except on TV. The Johns Hopkins database gives me 3238 “counties” in the US. (Some are not real counties but “unassigned”, etc.).

1829 -> 0
430 -> 1
191 -> 2
109 -> 3

56% have zero deaths currently. At the top end are the usual suspects:

New York City 17515
Wayne 1622
Nassau 1620
Cook 1347
Suffolk 1102
Essex 1028
Westchester 962
Bergen 960
Los Angeles 944

Here is a histogram of the number of counties with the given number of deaths, excluding the bins for values >50 or <3.

The vast majority of counties in the US have experienced have almost no deaths from Covid-19.

Here's another view of the data. Each county's value was put into a list. The list was then sorted, and plotted. I cut off the top end of the distribution to look more closely at the low numbers.

As I said, about 1800 counties have 0 deaths and something like 90% have less than 20.

(A surprising number of counties are simply not reporting. I've assumed those are 0).

This is all part of a project to download and play with the data collated by the Johns Hopkins CSSE folks. You can find a Github repo with my code here.

My version of the database is constructed from their database files by update.py. This checks the csv.source directory and if it's not up-to-date, downloads the appropriate data files from their data.

It's all Python2 code, mainly because I can't be bothered to type print(s) instead of print s.

My database looks like this

2020-03-22
2020-04-28

Autauga;Alabama;01001;US
0,0,1,4 ...
0,0,0,0 ...

I also got an urge to make some analyses normalizing to population. It took some effort to find the data and fix differences with the entries in the Covid-19 database.

The county where I live is #588 with 1.7 deaths per 100,000 population. You can find the complete list here.

I only just finished re-writing the code (this is the 5th iteration), so there aren't many projects stored in the repo yet (2020-04-28).

Thursday, April 23, 2020

SARS-CoV-2 pathogenicity

A popular narrative about the 1918 H1N1 influenza pandemic is that the virus caused a milder illness in the spring of 1918, but a much more severe one in the fall, with a W-shaped fatality curve, killing young adults particularly well. People worry about viruses mutating to become better killers.

It is important to note that there is no molecular evidence to support this hypothesis about flu, partly because the spring events certainly included some cases of the then-standard flu. Some people disagree. You can read about it here.

Now (April 2020), people are concerned that the novel Coronavirus (SARS-CoV-2) might mutate into a more pathogenic strain, although both from experience and on general principles it is much more likely that with time its virulence will decrease. There are certainly exceptions that have maintained virulence (smallpox).

Samples of SARS-CoV-2 from around the world have by now accumulated thousands of different mutations, and these different phylotypes can be assembled into trees. As someone who has studied phylogenetics I find it particularly cool that time can inform the construction of the tree.

Those viruses of Wuhan -> Europe origin have particular mutations contributed by founding lineages, but as far as I know, at this time there is no evidence to support any physiological differences for any branches in the virus phylogenetic tree. There is no basis for talking about different virus strains.

There is no evidence for mutation of SARS-CoV-2 to increased pathogenicity, at least as yet.

Then, there is this one paper that was posted on medrxive the other day (4-14-2020): [pdf].

It got a write-up in the South China Morning Post.

"Professor Li Lanjuan and her colleagues from Zhejiang University found within a small pool of patients many mutations not previously reported. These mutations included changes so rare that scientists had never considered they might occur."

It's getting breathless play on Fox as of two days ago, although they are fundamentally confused about just what was reported.

Preprint servers like medrxive are useful, but none of the work is peer-reviewed and in this case, with a hot topic and a not-so-hot paper, it can contribute to confusion and misunderstanding.

So let's take a look. The title is:

Sars-CoV-2 has acquired mutations capable of substantially changing its pathogenicity

Samples were taken from 11 PCR+ patients near Hangzhou. Sample dates were 1-22 to 2-4.

Filtered samples (swabs, saliva, or stool) grew SARS-CoV-2 on Vero cells after 4-5 days (Vero cells are African green monkey kidney epithelial cells, which grow very well). Virus particles were collected by centrifugation and then filtered. Each pool originates from a different patient sample.

Pools (one per patient) were sequenced (Novaseq 6000). In total, 31 mutations were identified, compared to the type strain (ref seq). There is a lot of discussion of the mutations observed, but without evidence of a physiological effect there is not much point in discussing them further.

One critical point is that the viruses were not pure samples, they are mixtures. Because of high mutation rates, in general any culture of bacteria, or viruses that infect bacterial or eukaryotic cells, is a mixture genetically. However, this means that the sequences cannot be reliably tied to the effects of the virus pools. For a bacterium or a bacterial virus we would isolate single colonies or virus plaques before beginning any experiment. For a eukaryotic virus, a different technique would be used, but it is not difficult. The only reason it wasn't done here is that it takes time.

The pools were then used to infect Vero cells. The scientists measured virus growth dynamics by qRT-PCR.

Let's just take a look at Fig 3A:

Ct refers to the PCR cycle during which specific product is first detected (threshold cycle). Lower Ct means higher initial concentration, and assuming nearly 100% efficiency for the amplification cycle, a difference of 1 cycle is a factor of nearly 2 in virus titer in the original inoculum.

There are 11 samples analyzed in duplicate, with a line drawn through the successive means of the two replicates at each time point.

The difference in Ct between the highest and lowest samples at early times of infection is about 5 (i.e. 2^5 or 32-fold), while at late times it is as much as 8 (2^8 or 256-fold).

My first thought is that, eyeballing the data, it looks like there is a correlation between the early Ct and the 24-48 hr Ct --- that is, pools with greater titers at the end had higher input at the beginning. In other words, although they probably tried to infect with equal amounts of virus (they do not say), they did not achieve this, and samples that got less virus applied in the beginning had less virus at the end.

Secondly, it appears that 3 pools produce less progeny virus than the rest. A skeptic could argue that rather than identify a super-pathogenic strain, what they've done is isolate several mutants that are attenuated.

There is nothing about how the "wild type" or parental virus would perform in this assay.

One isolate might be exceptional (Z11). This is the only datum that supports the title.

The paper has a lot more speculation about the mutations observed and for changes in the proteins, discussion of where they map on those proteins. The "changes so rare" (Post write-up) is one mutation that changed 3 adjacent nucleotides, which is actually not so rare.

Conclusion: the hype is not supported. This is not a good paper and the defects are not remediable. I would reject it.

And we should absolutely, definitely, clearly not be using such a paper, even if it should survive peer-review, to inform the thinking public about whether there exist distinct SARS-CoV-2 strains with different pathogenicities. Especially, increased pathogenicity. That's just scaremongering.