Python for Bioinformatics: Phylodynamics

I was chatting over Google Duo with family last weekend and the topic came up of SARS-CoV-2 possibly circulating in the U.S. in late 2019.

Many people report strangely nasty cases of flu. And of course, Rush Limbaugh claimed that all of California had been infected in November, to support the idea that "herd immunity" already exists, and the country just needs to "open up."

I said that the genetic data show that this is impossible.

One person (an M.D.) said: "maybe the genetics is wrong." I was just dumb-founded. The evidence is overwhelming. So I started today on trying to explain why that's so.

There is a guy named Trevor Bedford, his latest blog post is here. I will just summarize what he said and add a few tidbits of my own.

In phylogenetics, we can make trees of related organisms based on various characteristics, often the nucleotide sequence of the organism, or selected genes (when comparing species, protein sequences are even better). These trees show lines of descent from a common ancestor.

We can use the differences between aligned sequences to infer phylogenetic trees. Here's an example.

We don't usually have the ability to sequence ancestors, but in a rapidly mutating virus (or other microbial) population, data can be obtained in a time series. Currently, there are several thousand SARS-CoV-2 sequences along with the dates on which the samples were taken from patients.

These can be assembled into phylogenetic trees, like this one from NextStrain.

This is a screenshot of the whole tree as of today

Here's a link to a phylogeny restricted to the dates 2019-12-19 to 2020-02-06.

Now, I've been out of the game for a while, so I have not actually sat down and studied their methods carefully. I know that holy wars have been fought over "which is the best" method for building trees. Nevertheless, this is a phylogenetic tree which arranges the branches from left to right in inferred order of descent from a common ancestor.

In this particular tree, the branch lengths are based on date of sample. The sequences are colored by geographic origin. So for example, purple in China and red is North America.

Let's just focus on one or two aspects for now.

The red dot right identified in the graphic above has the unwieldy official name 2019/nCoV/USA-WA1/2020.

It is a sequence determined for a sample taken from a guy who lives in Snohomish County, Washington. He went to visit his family in Wuhan in January, returned to the US, and developed symptoms of Covid-19. He was admitted to hospital and test results came back the next day (Jan 20) as Covid-positive.

He is the first individual known to be Covid-positive in the United States. Let's take a look at the sequence. I got the genome for the wild type virus and his virus from the links below.

To compare two sequences, I just used a tiny bit of Python (must search for a web tool).

ref v. WA1
C 8782 T
C 18060 T
T 28144 C

[Update: messed up the first time, since I did it by hand the sequences had newlines which screwed up the numbers].

[Update: according to the paper (pdf), all three of these mutations have been found in virus sequences from patients in China.

So, relative to the "wild type" version of the genome, the WA1 virus carries 3 mutations. These are unlikely to mean anything. (I need to check whether they change the sequences of the proteins encoded by the relevant genes, but it is still nearly certain that they make no physiological difference).

As Trevor writes on the blog, the second positive in Washington occured in February and the sequence was available on Feb 29. It is called 2019/nCoV/USA-WA2/2020.

C8782T and T28144C (found in 74/224 sampled viruses from China)

C18060T (found in 6/224 sampled viruses from China)

So C18060T is quite rare. Yet it is also found in the second virus sample from WA.

ref v. WA2
C 5784 T
C 8782 T
C 17747 T
A 17858 G
C 18060 T
T 28144 C

WA1 v. WA2
C 5784 T
C 17747 T
A 17858 G

What you are looking at is a transmission chain for the virus. Somehow the Snohomish patient transferred the virus to someone else to someone else to someone else and that high school student showed up to be treated and have his virus sequenced 3 weeks later, with no travel history and no known contact. The two patients live 15 miles apart.

They worked hard on Snohomish, tracked down the limo driver and his office mates and so on, and none of them got sick or ever tested positive.

Could this happen by chance? Theoretically, there could be a different traveller who introduced the same virus, but "only 2/59 sequenced viruses from China possess this variant." Very unlikely.

On the other hand, early samples from California indicate independent introductions from China. There doesn't appear to be a CA1 either on NextStrain or NCBI, but there is a CA2.

ref v. CA2
C 17000 T
G 26144 T

The CA2 sample doesn't have the same mutations as WA1. It is not a direct descendant but a cousin.

So there are several basic arguments I can make about the possibility of SARS-CoV-2 circulating earlier than mid-January.

First, the phylogenetics show that the virus emerged in Wuhan, China, late in November or early in December, 2019. All of those virus sequences are very closely related. If the virus had been in humans for more than a couple months the diversity of sequences would be much, much greater.

If the virus had been in California in November and December, the diversity of sequences would also be much greater.

Second, most later cases bear characteristic mutations that indicate a transmission chain back to an introduction in January or February. Tens of thousands of people traveled from Wuhan to other places then, and some of them came to the US. You will recall the travel ban at the end of February. It was too late.

Here is a screenshot from the NextStrain site.

It looks like Snohomish is headed for Nebraska, but that's just because they send the arrows to the labels on geographic locations.

We may not have samples from all the lineages, but the overall pattern is quite clear, and convincing.

wild type RefSeq: NC_045512.2 Wuhan-Hu-1
WA1: Genbank accession MN985325
WA2: Genbank accession MT152824
CA2: GenBank: MN994468.1

[Update: here is an excellent Twitter thread on origins of SARS-CoV-2.

Wednesday, April 29, 2020

Phylodynamics