Python for Bioinformatics: May 2021

Friday, May 28, 2021

3Blue1Brown on Bayes rule using odds

video

Wednesday, May 26, 2021

Origins of SARS-CoV-2

Here are some links to helpful articles about the "lab-leak" hypothesis:

Nicholas Wade (link), Yuri Deigin (link), Rossana Segreto and Yuri Deigin (link)

Andersen et al. (link), Yangyang Cheng (link), Glenn Kessler (link), Matt Yglesias (link)

Daniel Engber (link)

My favored hypothesis has been that the Covid-19 pandemic began in the same way the original SARS pandemic did, in a "spillover" of SARS-CoV-2 from a natural reservoir of the virus in bats, by way of an intermediate host.

Along with much of the scientific and political world, this week I've been reading, and updating my likelihood of the alternative hypothesis, that the pandemic began when SARS-CoV-2 escaped in an accident, from a lab at the Wuhan Institute of Virology (WIV).

The problem with the spillover hypothesis is that, unlike with the original SARS, there is very little supporting evidence. Rather, what we know about SARS-CoV-2 indicates that the virus appeared in humans in a single event. There is very little diversity in sequences of early virus samples. It was not strongly associated with a "wet market" and there is no known intermediate host.

When the virus appeared in humans, it was already well-adapted to growth in them, with an S protein that fits its human receptor (the ACE2 receptor protein) very well, as well as a furin-cleavage site that is known to facilitate viral entry. The furin site is essential for infectivity in humans.

With respect to the escape hypothesis, it did not help that Donald Trump supported this idea, suggesting the sinister possibility that release not only explains the pandemic, but that it was deliberate. Since Trump has lied repeatedly about nearly everything, his statements actually decreased (at least for me) the likelihood that this hypothesis is correct. Add to that the fact that so many right-wing politicians are looking to pick a fight with China.

I do not put much weight on vague reports that WIV staff were hospitalized with flu-like illness in the fall of 2019, including other reports that samples from them were tested for SARS-CoV-2 and came back negative. The former reports come from "intelligence sources" who were motivated to support Trump's contention, and the latter from official Chinese government sources equally motivated to keep things under wraps. They are possibly true, but it is impossible to evaluate their accuracy now.

I also do not think much of the idea that, as David Baltimore reportedly said, the furin site is a "smoking gun". Acquisition of such a sequence could easily have happened naturally, by recombination. The site's codon usage does not provide convincing evidence that it was engineered. Indeed, similar sequences that might have been recombined to give the furin site would perhaps also have been out-of-frame originally, and recent acquisition hasn't given enough time for codon optimization to the human host. Also, if you were engineering a site, you would not likely have chosen this sub-optimal sequence over known examples that work better.

Modern methods do not leave signs in the sequence (such as convenient restriction sites for assembling a whole genome from DNA clones of parts of it) that would be unmistakable "signs of engineering" for older techniques. So the frequent claim that the sequence doesn't show such signs isn't dispositive either.

What I find most troubling is the reporting that WIV (as well as labs elsewhere) were doing manipulations with parts of the S gene (the RBD). They built hybrid viruses with S genes whose product was known to bind well to human cells. These were done not just with pseudo-typed lentiviruses (which would be safe), but with bat coronavirus backbones (which is definitely unsafe). Incredibly, some labs even added furin cleavage sites. Multiple labs did such experiments, in China, in the US and Japan, and in other places.

Adaptation experiments were also carried out, by serial culture in human cells and in "humanized" mice bearing Hu ACE2 receptor. Richard Ebright is correct to insist that all these experiments were exceedingly irresponsible.

A second problematic aspect is lab safety. Shi Zheng-Li, the PI on these projects at WIV, stated that their work was carried out at BSL-2 and BSL-3, and Wade seizes on this to claim repeatedly that dangerous experiments were carried out at BSL-2. If this is true, then that would also be very very bad, and even in the absence of any other confirming evidence, would weight the scale heavily in favor of the escape hypothesis.

A third problematic feature is the publication of a viral sequence closely related to SARS-CoV-2 (the one most closely related) just after the beginning of the pandemic, by Shi and collaborators. This sample was clearly from around 2013. It is reported that the sequence file naming scheme indicates it was sequenced in 2017 and 2018. Their attempt to hide this and pretend that it was only sequenced recently, is also troubling, if this is true.

Even if the relationship between the US and China were less antagonistic, I do not imagine that the Chinese government would ever allow investigators to access the original notebooks and sequence databases of the WIV investigators. So at the end of the day, we shall probably never know the truth. All we are left with is likelihoods that are heavily weighted by our "priors" about the two hypotheses.

The real bottom line here is that even if the pandemic started by spillover, it certainly could have started by lab escape. We should act on that, to ensure it doesn't happen again.

Update:

Just to be clear though, the post above is my analysis of the likelihoods for two different hypotheses of what might have happened.

Florian Kramer advises patience. It took more than one year to find the relevant evidence for SARS. For some other viruses, we have still not found the source of the spillover. I think his priors for lab escape are too low, but still, this thing is unresolved. It might change in the future if evidence supporting spillover becomes available.

However, if the leak theory is what actually happened, the evidence is going to be buried very deep. Plus, the Chinese will never allow such an investigation. All that pressing for a scientific investigation will do is lead to China-bashing, which is not helpful. And the idea that the American intelligence community will sort this out is laughable.

What matters most is for us to accept that it could have been a leak, and change our attitude about whether research of that type should continue.

Saturday, May 8, 2021

Sums from a uniform random distribution

Recently, I came across some pages about a problem that I simulated in Python years ago, but never solved analytically, although a reader left a hint in the Comments.

Here we start by taking a look at what this author calls "random number blackjack" or the problem of the sum of uniformly distributed random numbers.

For the rest of the post, all the random numbers that we will talk about are given by a random number generator which draws from uniformly distributed real numbers in the range [0,1), i.e. greater than or equal to zero, and less than one.

We also consider the sum of one or more independent values from such a distribution, which forms a different distribution.

the game

On each turn of this unusual game, you draw a random number. You receive $100 for each turn you play, including the last. Suppose you drew 0.45, 0.35, 0.25, on the third turn you would go bust, since the sum of the numbers drawn would be 1.05 > 1.0. The payout would be $300.

And the question is, what is your expected winnings if you play the game many times? Is it worth it to play (accepting some risk of bad luck) if the initial buy-in requires $250?

initial thoughts

First of all, since the generator always gives a number less than 1.0, you always play at least two turns. The expected value (mean) of the first number is 0.5. For a continuous distribution, the mean has a technical definition, but it is always on the axis of symmetry of the distribution, which is obviously 0.5 here.

The expected value of the sum of several random numbers is the sum of their expectations.

So, for example, E[3] = 1.5. Most of the time we will quit after two or three turns, but there will occasionally be an extended run of smaller numbers and a corresponding increase in the number of turns.

[spoiler alert, you may want to look at random number blackjack to try to work out the answer]

What famous number do we know that lies between two and three? This is so trivial to simulate in Python I won't even post an example.

On one run I got 2.718531.

So it looks like the result is equal to e. (The internet says that adding more rounds doesn't help the accuracy because of limitations in the random number generator).

I came across this problem in a slightly different form in the terrific introductory calculus book, Calculus Made Easy, which was originally written by Sylvanus P. Thompson (and available as a Project Gutenberg download).

It was added by Martin Gardner when he edited the classic work (1998 ed., p. 153) and is simply a note about how e shows up everywhere.

But the problem is at least as old as Feller's classic text Probability (which I don't have, unfortunately).

simulations

I wrote two simulations to show results relevant to this problem. The first one shows the distribution of sums of n = 1, 2, 3 or 4 random numbers. As you can see from the figure

even 3 at a time, the sums look pretty close to a normal distribution. The Central Limit Theorem says that they will tend to normal, and there is a bunch of theory that I don't understand that says if the draws are from a uniform distribution then the convergence is very rapid.

I got curious about this alternate game, so I wrote a simulation which shows that the sum of random numbers, when computed as above, by discarding the whole numbers from the result, appears to be still random uniform. (gist here). The original data and the summed series are plotted in the same hisotgram with transparency 0.5. The new data is random uniform or close to it.

I don't know what the theoretical explanation for this is. However, if it's true, then rather than do the sums, we can just draw from the random uniform distribution, and tally up the runs where all the values are increasing, until the last one If we do the bookkeeping correctly, we get e as the result.

That means the original problem has the same as the alternative one.

serious analysis

I have reworked what is on the Mathworld page as follows:

That's where I am so far. There's plenty more to investigate.

The sum of two random numbers from a uniform distribution has a distribution that is given by convolution of the individual distributions. But then each distribution for n > 2 is formed by another convolution. Ultimately, the distributions tend to the normal.

I don't see how you get to something as simple as 1 - 1/n! from that, although Tim Black gave us a different path above, which is why I wrote this post.

[Update: I've been dense. The "different path" is in fact the means by which the integral is evaluated. It is not done by writing some complex expression and then seeking the antiderivative and evaluating it. Instead, we know that the value for the cumulative distribution function at the upper bound must be 1, and at the lower bound it must be 1/n!. ]

There is a suggestion that this sort of thing is done more easily with generating or characteristic functions.

Probably the first, simple thing would be to run the simulation using random numbers and not bother with the sum part, as we also looked at here. [Update: the result is as I suspected. See gist. If we simply find the length of runs in a stream of random numbers from a uniform distribution, where they are in increasing order, and then find the mean of those lengths, the result is e.]