Python for Bioinformatics: 2020

Thursday, December 24, 2020

Archimedes and pi

I've written a lot about Archimedes and the approximation of pi over the years (appropriated from the web, actually).

Most of that material is in my Analytic Geometry book. I am in the process of re-doing the Github repos for the books, so they're not currently up-to-date, but that will change over the next week or so. You can find me on Github here.

In the last couple of days, I wrote a summary of the math of Archimedes with respect to pi, including how his formulas connect with others based on perimeter and area that are due to Gregory. There is also extensive discussion of methods to approximate the square root of 3, essential to Archimedes work.

That pdf is on Dropbox here. Enjoy, and as always, comments would be welcome.

Monday, December 14, 2020

Pain with mac OS

I've run Macs ever since 1984. Never had anything else except one Raspberry Pi running Linux.

But I'm alarmed by clear signs of decay in QA for Mac system software. Of course, they say you get what you pay for. :) But Apple is a hardware company and I've gladly paid a premium for a beautiful OS and well-designed hardware.

My current issue (though not the first) is the latest mac OS 11 (Big Sur).

So there I was, minding my own business, running Catalina and streaming video through the browser, Safari, when out of nowhere I got the spinning beachball. Machine is unresponsive, unable to kill the app, transitioning to a black screen of death, which was new to me. Bizarrely, it had a cursor which responded to the trackpad.

Luckily, there is this thing called googling. Turns out you can press the power button for 10 seconds and the laptop will shut down. Reboot and it seems fine. So what happened? The internet says that it has to do with system software updating, some conflict that happens with sleep.

I go to System Prefs > Software Update and it looks like this:

I never set this. My 1 yr old laptop came with this as the default. So the auto-updater crashes the machine. Apparently it is downloading/installing Big Sur.

I uncheck everything. Then, some demon tempted me to download and install Big Sur myself, which I did manually from the main screen for this preference. Repeatedly (3x), I download 12 GB, the install fails, and then the 12 GB is deleted (or at least, neither I nor Update knows where it is). You gotta start over. The third time, I am on it when the download finishes, and I hit the button, so it goes.

Not crazy about the updated UI, but that's life.

Now, one day later, there's an update.

Since auto-update can crash, I have only the first checkbox selected in the first window, above.

I manually download the 11.1 update (Update Now). Three times through, I download 3.x GB, then it stalls. Right here.

I don't have the patience to do anything more. I'm just angry. How hard could it be to set up a fake install and test whether it works properly? And then to throw it away each time. They just don't frickin' care anymore.

Saturday, July 25, 2020

Rosalind Franklin's centenary

Today is the 100 year anniversary of the birth of Rosalind Franklin, and I have something to say about Nobel Prizes and scientific reputations.

I think we can agree, from the start, that prizes for science suck. In particular, the Nobel Prize sucks. Some prizes have been given to yahoos. Other prizes have been given to mentors but not to the juniors who actually did the work, and in some of those cases the juniors had the great ideas. See Enders, John for a counter-example.

Some prizes were limited because more people deserved them than just the maximum of 3.

Bob Gallo should have been awarded the Nobel Prize, he invented every tool for studing HIV that anyone ever used. But he was an asshole, and the Swedes decided to make a point.

Rosalind Franklin arguably deserved the Nobel Prize. Her major problem was that she died of cancer several years before the prize for DNA was awarded, and there is a rule against posthumous awards.

One often sees (now more frequently) the claim that Watson and Crick "stole" Franklin's data and received recognition for discovery of the double helix when it was rightfully hers. It's feminist dogma.

It's also revisionist nonsense.

It is false that Franklin discovered the double-helix. What she did was to take a beautiful X-ray picture of DNA that Crick knew immediately implied the double helix. Franklin herself was unsure. She wrote "if a helix..." Furthermore, this would be her only contribution, seminal to be sure, but her only one.

Franklin was in some respects in competition with Watson and Crick, and she was definitely unhappy when it became known that her boss Wilkins (she disputed his authority), and Max Perutz, had made her picture available to Watson and Crick. It had already been shown in local seminars. Perutz, as straight an arrow who ever lived, was not bothered about how things went down.

Beyond that, the discovery of the double helix relied on deep crystallographic and structural knowledge of Crick, as well as the enormous luck of Watson. He had no idea what the picture meant, but he knew about Chargaff's ratios, and he knew the correct structures for the bases in solution. This was crucial. The ultimate aha moment was Watson's, much as I would wish otherwise. (Because he is also an asshole).

Franklin was, in years shortly following the events, friendly enough with Crick and his wife that they vacationed together. There was no animosity between them about DNA. She died 5 years later, in 1958.

If you dispute any of this, go re-read Judson's book and Perutz's writing about this and get back to me.

The idea that Watson and Crick stole the double helix from Franklin is complete nonsense.

[Update: re-reading Judson after 40 years, there is a long discussion of who knew what, when, mainly informed by interviews with Perutz. In particular, the crucial inference from Franklin's photos was not that the structure is a helix, she recognized that, but that the symmetry is "monoclinic". This has as the consequence that a 180 rotation is symmetric, which then has as a consequence that a two-chain model must have the chains going in opposite directions. ]

Sum of angles, revisited

I've run into a number of different derivations for the "sum of angles" formulas over the years. These are

cos s + t = cos s cos t - sin s sin t

sin s + t = sin s cos t + cos s sin t

For example, here is one using Euler's formula, and here is a derivation of the difference version of the cosine that I encountered in Gil Strang's Calculus textbook. As I've said, I find it only necessary to memorize one:

cos s - t = cos s cos t + sin s sin t

which is easily checked for the case where s = t.

Recently I ran into a couple more. Probably the simplest is this one:

Two stacked right triangles with a surrounding rectangle. The upper triangle is sized so that the hypotenuse is 1 and the sine and cosine are obvious. For the triangle with angle phi, we need an extra term in the sine and cosine so that when dividing, say, opposite/hypotenuse, the result is correct.

The angle theta + phi is known by the alternate interior angles theorem, and the small triangle with angle phi is known by a combination of the complementary and supplementary angles theorems.

Now, just read the result. One diagram gives both formulas!

There is also a simple derivation for the sine formula based on area calculations. We calculate the area of this triangle in two ways:

On the left, we have that

A = (1/2) a sin (theta + phi) b

On the right (h = a cos phi = b cos theta) and

A = (1/2) h a sin phi + 1/2 h b sin theta

= (1/2) (b cos theta a sin phi + a cos phi b sin theta)

Equating the two and factoring out (1/2)ab, we obtain the addition formula for sine. (Unfortunately, I've lost the links where I saw these).

Here is a link to a rather comprehensive chapter on the subject from my geometry book (Github).

Thursday, June 25, 2020

Shapefiles

I've been working with maps using Python, primarily maps for the United States. The standard format for much geographic data is GeoJSON.

But there is another format that is even more official, and is maintained by the US Census Bureau. That is a collection of Shapefiles.

To recap, here is a site where GeoJSON files for the Us are available in multiple sizes (small, medium, large), as well as their .kml and Shapefile .shp equivalents. The sizes are 500k, 5m, 20m, from largest to smallest (the labels are the scales, 500k being the most detailed).

The files were obtained from links on this page. It includes data for the US, for the states, and for counties. And it contains Congressional districts, which would be useful to remember.

A Shapefile is binary-encoded geographic data in a particular format. A good discussion is here.

The specification was developed at least partly by ESRI, which develops geographic information software.. The encoding was undoubtedly designed to save space, back when space (storage, transmission bandwidth) was much more expensive. Now, the opaque data is a liability.

Shapefiles

There is actually not just one file, but always a minimum of three including .shp, .shx and .dbf inside a zip container. The Shapefiles from the US Census also have .prj and .xml.

I can't tell much by looking at one with hexdump, except that most of it is aligned (in sections), on 16-byte boundaries. The format is described at this link, but I haven't worked with that.

One way to open and read a Shapefile is to use geopandas. I grab that with pip3 install geopandas

The example is

>>> d = 'gz_2010_us_040_00_20m'
>>> fn = d + '/gz_2010_us_040_00_20m.shp'
>>> df = gpd.read_file(fn)

The columns are:

GEO_ID
STATE
NAME
LSAD
CENSUSAREA
geometry

>>> df.NAME.head()
0        Arizona
1       Arkansas
2     California
3       Colorado
4    Connecticut
Name: NAME, dtype: object

For some reason the order is several joined partial lists of states, each one alphabetized.

We need to extract the coordinates for a particular state:

import geopandas as gpd
fn = 'ex.shp'
df = gpd.read_file(fn)
sel = df['NAME'] == 'Maine'
g = me = df.loc[sel].geometry
from shapely.geometry import mapping
D = mapping(g)
for f in D['features']:
    print(f['id'])
    L = f['geometry']['coordinates']
    
    for m in L:
        print(len(m[0]))
        print(m[0][0])
        print(m[0][1])
        print()

> python3 script.py
39
11
(-69.307908, 43.773767)
(-69.30675099999999, 43.775095)
11
(-69.42792, 43.928798)
(-69.423323, 43.922871)
...

Pythagorean theorem redux

Here is a cool proof I saw on Twitter, it was an RT by @StevenStrogatz from this guy:

Take a generic right triangle. Flip, rotate and scale it by multiplying each side by a factor of b. I did this by imagining that a = 1 and then b = ab. The complementary angles are marked with circles on the right.

So then construct a rotated triangle from the same input, but scaled by a factor of a, and attach it to the other one. We know that the sides marked ab are parallel, and that the angle between bc and ac is a right angle, by the properties of right, complementary and supplementary angles.

The four outside vertices form a rectangle, from the angles and also since ab = ab.

Finally, rotate and scale again, by a factor of c:

Slide them together

□

Albers projection

The standard projection used in converting latitudes and longitudes on the (roughly) spherical earth to a planar map depends on what you’re projecting.

Most people know about the Mercator projection.

The one used extensively for maps of the United States is called the Albers Equal-Area Conic projection.

The wikipedia article gives some formulas:

You must first choose two reference latitudes. Latitudes are referred to by the letter phi and longitudes by the letter lambda. So the two references are phi_1 and phi_2.

You also choose a center for the map, at phi_0, lambda_0.

From these four values you calculate n, C and then rho_0, which are each the same for every transformation in this projection.

Finally, for each coordinate phi, lambda one calculates rho and theta and then finally

x = rho sin theta

y = rho_0 - rho cos theta

This, however, is for the assumption of a spherical earth. The equations for an ellipsoid are a bit harder.

The wikipedia article gives this url for a pdf and I found the same report referenced in this answer to a question on Stack Exchange.

so that was lucky, because it gave me not only the equations but also a worked numerical example for each, which helped greatly in finding my mistakes in the code.

The math is explained in a write-up done in LaTeX, as a Dropbox link to a pdf. The math has been encoded as Python scripts sphere.py and ellipsoid.py here.

Here are screenshots from the manual:

The test is a latitude of 35°N., -96°W., which should give a result (for the ellipsoid) of

> python ellipsoid.py 
test1
p0: 23.0
l0: -96.0
p:  35.0
l:  -75.0
x:  1885472.7
y:  1535925.0

The first part of the output is the center we chose. The result for x and y matches the source.

So then, for a script in the same directory as ellipsoid.py, we can do import project and do

> python3 map_counties.py counties.geo.txt AL

No longer squashed.

Naturally, there is software out there that will do this. In particular, the Proj transformation software.

I obtained it with Homebrew

> brew install proj


> echo 55.2 12.2 | proj +proj=merc +lat_ts=56.5 +ellps=GRS80
3399483.80	752085.60

which matches their example.

Of course, we really want the Albers equal area conical projection based on the ellipsoid.

> echo -75 35 | proj +proj=aea +lat_1=29.5 +lat_2=44.5 +lat_0=23 +lon_0=-96 +ellps=clrk66
1887211.95	1533994.75

These are close but don't quite match. I will have to explore Proj more to know why.

Put the input data into a file and do:

> cat data.txt | proj +proj=aea +lat_1=29.5 +lat_2=44.5 +lat_0=23 +lon_0=-96 +ellps=clrk66
1887211.95	1533994.75
>

Plotting polygons

In the previous two posts (here and here), we looked at geographical heat maps (choropleth maps), constructed using plotly. The purpose was to visualize COVID-19 data. Now, we're going to dive deeper into maps, and for that purpose I made a new github repo.

As we said, it is easy to use plotly.express to make a map, say of the entire United States, or countries of the world, or US counties.

However, there are some limitations. plotly is developed by a company whose main product appears to be Dash, which aims to be a fancy UI layer for working with data science overlying the basic widgets and libraries.

Mapping, in particular, is an issue. While plotly.express.choropleth is easy, it is also limited.

Once you want to dive deeper (for example, to zoom in on a map of a single state before display, or add labels to a map), you run into the fact the express layer doesn't allow it, and that the underlying maps are actually from a company called Mapbox.

Mapbox wants to be a kind of Google maps that you embed in your iPhone app, only better. They provide an introductory free level, but you have to sign up and obtain a Mapbox Access Token for many things.

I decided to go without. For the moment, I'm going to continue with plotly, although ultimately I'm leaning toward the position that matplotlib will be a better choice.

So what we're going to try here is to plot the map components (states) as polygons. We'll run into some issues and solve them.

The first script is extract_counties.first.py.

There's nothing special about it except (i) it deals with the encoding issue we talked about before and (ii) it has a significant bug, which we will diagnose and fix.

> python3 extract_counties.first.py counties1.json > counties1.geo.txt
>


Autauga
Alabama
01001
-86.496774	32.344437
-86.717897	32.402814
-86.814912	32.340803
-86.890581	32.502974
-86.917595	32.664169
-86.71339	32.661732
-86.714219	32.705694
-86.413116	32.707386
-86.411172	32.409937
-86.496774	32.344437

Baldwin
Alabama
01003
...

Now we'll just use plotly to draw all the polygons for the state of Alabama using

> python3 map_counties.first.py counties1.geo.txt AL

That looks kind of promising, it is recognizably Alabama.

There are a couple of problems. The first is that the shape is squashed.

And the second, more subtle, is that there is a missing county at the southwest corner: Mobile.

This issue becomes more obvious when we try California.

That's definitely wrong. We are missing Los Angeles Ventura and Santa Barbara counties, and if you look closely, San Francisco is missing as well.

The solution came to me when I realized that the term "MultiPolygon" had been mentioned, and I hadn't really understood what it was. I had just adjusted for the case when the coordinates were wrapped by four '[[[[ .. ]]]]' versus three '[[[ .. ]]]' without thinking about it.

As an example, let's look at entries in the GeoJSON data that are MultiPolygons (most are just plain Polygon). It is Bethel (a County or Census Area) in Alaska. Here is a list of the first 18:

Bethel, AK
Hoonah-Angoon, AK
Juneau, AK
Kenai Peninsula, AK
Ketchikan Gateway, AK
Kodiak Island, AK
Lake and Peninsula, AK
Nome, AK
Petersburg, AK
Prince of Wales-Hyder, AK
Sitka, AK
Valdez-Cordova, AK
Wrangell, AK
Mobile, AL
Los Angeles, CA
San Francisco, CA
Santa Barbara, CA
Ventura, CA
..

We notice that Mobile and the four California counties are also on the list.

A map of Bethel reveals that it has not only an island but also a lake!

Bethel County (actually Census Area), Alaska is a MultiPolygon and has three arrays

[
 [[[-173.116905, 60.516005] ..  36 items .. [-173.116905, 60.516005]]],
 [[[-165.721389, 60.16962]  .. 128 items .. [-165.721389, 60.16962]]], 
 [[[-160.534142, 61.947257] .. 314 items .. [-160.534142, 61.947257]]]
]

Bethel County includes a large island called Nunivak Island.

The city of Mekoryuk on Nunivak has a lat,lon of 60.370411,-166.257202. This suggests that the second list is the polygon for Nunivak.

[-166.310655, 60.377611], [-166.200019, 60.393404] is one adjacent pair of vertices in the second list.

I thought from the restricted range of values in the first list that it was probably the lake, but it's not. It's a second, very small island. I made a fake GeoJSON file with just Bethel County and then plotted it:

So the prediction is that Bethel County won't render properly by my script. It's not the greatest test. Most of Alaska is messed up (not shown)

I go back and fix extract_counties.py and then run the plotting script:

python3 map_counties.first.py counties.geo.txt CA

We have found the missing counties.

The next problem to solve is the squashed forms of the states. The reason for that is that the latitudes and longitudes refer to the globe (an oblate spheroid), while the map is a projection of those points onto a 2-dimensional surface. There are a variety of projections and the math is pretty complicated, so we will save that for next time.

Wednesday, June 24, 2020

GeoJSON

GeoJSON is a JSON representation of geographic data. Although JSON was developed for javascript, the format should be very familiar to any Python programmer. There are collections: dicts and lists, as well as simple types like: string, number, boolean or null. Dictionaries are called objects, and may also be values contained in lists or other dictionaries.

A typical GeoJSON file looks like this:

{"type":"FeatureCollection",
 "features":
  [
    {"type":"Feature",
     "properties":{"GEO_ID":"0500000US01001",
                   "STATE":"01",
                   "COUNTY":"001",
                   "NAME":"Autauga",
                   "LSAD":"County",
                   "CENSUSAREA":594.436
                  }
     "geometry": {"type":"Polygon",
                  "coordinates": [[ [-86.496774, 32.344437],
                                      ...
                                    [-86.496774, 32.344437]
                                ]]
                 }
     "id":"01001"
    }
 
    {        
    "type":"Feature",
    ... 
                   "STATE": "01", 
                   "COUNTY": "009", 
                   "NAME": "Blount"
    ...
    }
 
  ]
}

(I've formatted whitespace to my taste, YMMV).

At top level, it is a dict with two keys. The "type" is "FeatureCollection".

The second key is "features", which is a list of Feature objects.

The "id" is a FIPS code for the Feature. This Feature is Autauga County, Alabama. The FIPS code for the state is "01" and the full FIPS for the county is "01001".

Each Feature is a dict with four keys. The "properties" and "geometry" keys yield dicts in turn. The "geometry" dict has a key "coordinates" which gives a list of longitude and latitudes for each vertex of the Feature's map.

The coordinates are nested inconsistently. It it's two deep: a list with one element that is a list with many 2-element vertices. Sometimes it's three deep.

[ Update: this is a great over-simplification. The problem is that these are Multipolygons, and they do mean multi. That will have to wait for another post. Here's a link to the Mapbox documentation. ]

The number of vertices depends on the resolution. For some counties at high resolution, it could run to 300 or more tuples.

For use with plotly.express.choropleth, there's no reason to parse this stuff or to generate a file with only the desired Features.

Just load the data

import json
with open(fn,'r') as fh:
    counties = json.load(fh)

then make a pandas data frame with one or more elements like

    fips  value
0  01001      1
1  06071      2

import pandas ad pd
pd.DataFrame({'fips':list_of_fips, 'value': list_of_values})

(or just pass {'fips':list_of_fips, 'value': list_of_values} as the df argument to choropleth).

fig = px.choropleth(
    df,
    geojson=counties,
    locations='fips',
    color=["Autauga","San Bernardino"],
    color_discrete_sequence=cL,
    scope='usa')
    
fig.show()

The locations argument says to match the "fips" from the data frame with the default value in the GeoJSON, which is its "id". The colors can be specified as a color list like

cL = ['green','magenta']

See the plotly tutorial section on colors for more detail. The scope argument limits the region plotted on the map.

States are even easier, since plotly already knows the state GeoJSON data.

import plotly.express as px
fig = px.choropleth(
    locations=['CA','TX','SC'],
    locationmode="USA-states", 
    color=[1,2,3], 
    scope="usa")
    
fig.show()

The county GeoJSON data is available from plotly, see the first example for the URL).

I came across a collection of GeoJSON files on the web. This data is originally from the US Census, and has been converted to GeoJSON by the author of that post.

One slight hiccup is that the GeoJSON file includes Puerto Rico, which has municipalities with Spanish names. These are encoded with ISO-8859-1.

This uses high-order bytes to represent Spanish ñ, í and so on. Some of these byte sequences are not valid UTF-8.

> python3
Python 3.7.7 (default, Mar 10 2020, 15:43:33) 
..
>>> fn = 'gz_2010_us_050_00_20m.json'
>>> fh = open(fn)
>>> data = fh.read()
Traceback (most recent call last):
..
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 935286: invalid continuation byte

Take a look:

>>> fh = open(fn,'rb')
>>> data = fh.read()
>>> data[935280:935290]
b'"Comer\xedo",'
>>> list(data[935280:935290])
[34, 67, 111, 109, 101, 114, 237, 111, 34, 44]

This is for Comerío Puerto Rico. The byte (237, 0xed) is mapped to the accented í But this and the following two bytes are

>>> '{:08b}'.format(237)
'11101101'
>>> '{:08b}'.format(111)
'01101111'
>>> '{:08b}'.format(34)
'00100010'

In UTF-8 the first byte starts with 1110, which indicates that this is a three-byte sequence, but the third byte of the group does not begin with 01 and so is not valid.

There is a nice module to document such issues. I do pip3 install validate-utf8.

> validate-utf8 src.json
invalid continuation byte
  COUNTY": "045", "NAME": "Comerío", "LSAD": "Muno", "CENSUSARE
                                ^
...

To fix this

>>> fn = 'gz_2010_us_050_00_5m.json'
>>> fh = open(fn,'rb')
>>> s = fh.decode('ISO-8859-1')
>>> fn = 'counties.json'
>>> fh = open(fn,'w')
>>> fh.write(s)
>>> fh.close()

For my later explorations, I found it useful to parse the GeoJSON to a flat format with elements like:

01001
Autauga
Alabama
-86.496774	32.344437
-86.717897	32.402814
-86.814912	32.340803
-86.890581	32.502974
-86.917595	32.664169
-86.71339	32.661732
-86.714219	32.705694
-86.413116	32.707386
-86.411172	32.409937
-86.496774	32.344437

Shapefiles and conversions are complicated enough that we'll leave them for another post.

COVID-19 data analysis

I've been working on a project to analyze the data on COVID-19 cases and deaths (GitHub repo). The data is aggregated from public health departments by some folks at Johns Hopkins. Their dashboard is here, and the data are in this directory on Github.

Here is the help page for the covid project (it is the same for most of the scripts):

> python3 one_state.py --help
flags
-h  --help    help
-n   <int>    display the last n values, default: 7
-N   <int>    display N rows of data: default: 50
-c  --delta   change or delta, display day over day rise
-d  --deaths  display deaths rather than cases (default)
-r  --rate    compute statistics
-s  --sort    (only if stats are asked for)
to do:
-u   <int>    data slice ends this many days before yesterday 
-p  --pop     normalize to population
example:
python one_state.py [state] -n 10 -sr

And here is the output (today) for that example:

> python3 one_state.py SC -rs
               06/17 06/18 06/19 06/20 06/21 06/22 06/23  stats
Charleston      1264  1403  1554  1728  1836  2044  2251  0.094
Oconee            95   100   105   110   136   154   142  0.083
Pickens          348   367   429   464   499   529   570  0.083
Calhoun           47    48    58    62    69    73    74  0.082
...
total          20556 21533 22608 23756 24661 25666 26572  0.043

The statistic is a linear regression on cases, normalized to the mean of the values, and then the counties in my state (SC) are sorted according to the result. Charleston is my county, and unfortunately, it is the county with the highest rate of growth of cases in the state. Currently in the US, the top 20 counties are:

> python3 us_by_counties.py -rs -n 4 -N 20
                   06/20   06/21   06/22   06/23  stats
Thomas, KS             0       0      10      12  0.836
Hot Spring, AR        53      53     138     226  0.514
Holmes, FL            47      47      58     121  0.341
Jim Wells, TX         22      27      34      46  0.245
Brewster, TX          24      24      39      45  0.236
Erath, TX             44      44      44      85  0.227
McDonald, MO         170     366     371     403  0.215
Sharkey, MS            9       9      13      16  0.213
Blanco, TX            14      14      22      24  0.205
Newton, TX             6       8      11      11  0.2
Aroostook, ME         11      17      19      21  0.188
Tehama, CA            34      34      53      54  0.181
Sioux, ND             12      12      19      19  0.181
Okfuskee, OK           7       7      11      11  0.178
Bourbon, KS            9       9      14      14  0.174
Lawrence, MO          11      13      13      19  0.171
Harvey, KS            13      13      20      20  0.17
Pontotoc, MS          93      93     128     146  0.169
Letcher, KY            8       8       8      13  0.162
Live Oak, TX          10      10      15      15  0.16

I chose n = 4 so the output would be formatted correctly for the blog post.

As with any large dataset, there are some problems to work through, which are not solved perfectly yet. Also, I've focused more on the U.S. lately, so scripts for world data haven't been updated yet either.

What I got interested in and want to show is the generation of maps of the US by states or counties, or one or a few states by counties, where the fill color is based on, for example, the growth rate of cases. Here is the US by states.

I haven't generated the color bar horizontally yet, I just cut it out and rotated it, so the writing is rotated as well.

This type of map is called a choropleth map. I stumbled across a python tool for generating maps. It's part of the plotly library. It is as simple as

fig = px.choropleth(
    df,
    locations=abbrev,
    locationmode='USA-states',
    color=st,
    color_continuous_scale='Plasma',
    scope="usa",
    labels={'color':'growth'})

fig.show()

The details are slightly complicated, but not bad.

df is a pandas data frame that maps states by two-letter abbreviation to the corresponding statistic.

df = pd.DataFrame(data={'state':abbrev, 'value':st})

You need GeoJSON data for a county map (the states are already known to plotly.express). That data file is available from them.

The colors are mapped to the statistic st as read from the data frame. The last line of the call to px.choropleth assigns the title to the color bar.

The script for the US states is here and for the counties it is here.

This is the state of South Carolina today. These colors are an attempt to make the positives pop out more.

There's much more to discuss. I have always wanted to make a map of the US with my road trips plotted on it, something like this. For that we need to talk about GeoJSON data and how to obtain it, as well as the Albers projection that is used in making maps. It turns out that the standard methods from plotly have a significant limitation and I had a really weird bug in my code that I eventually figured out.

Finally, we'll need to find how to generate the data for each individual trip to overlay on the map. That's all for later.

Monday, May 18, 2020

Virus attenuation

Vaccines and IgA

It is widely held that attenuated live-virus vaccines are the best vaccines, at least for respiratory diseases, because they are able to induce an IgA response. Plus, nearly all viruses enter the host via mucosal surfaces (oral cavity, nose, gut, lungs), where the first line of defense includes IgA.

Poliovirus is the exception that proves the rule. Polio grows extensively in the intestine, hence its transmission by the fecal-oral route, resulting in a unique epidemiological role for swimming pools in advanced countries. However, polio doesn't seem to cause much pathology there. The major trouble occurs when it moves to the nervous system.

OPV (oral polio vaccine) induces IgA while IPV (inactivated polio vaccine) does not. This restriction is not a serious problem for the inactivated virus vaccine because it seems that IgG is sufficient to prevent polio's movement to the nervous system.

There is some prospect for development of adjuvants that would drive B-cells to switch to producing IgA. One of the best understood is cholera toxin, where binding to certain cells stimulates IgA-promoting cytokines (IL-1, IL-6, IL-10). However these are too toxic for use in humans. Perhaps someday we'll understand enough to be able get an IgA response with a killed virus, but that moment has not come yet. ref

Virus attenuation

We are left with the fact that historically, live virus vaccines have mainly been produced via attenuation. We can look at some modern approaches elsewhere, like smallpox or adenovirus or retroviruses that are already attenuated and can display viral antigens.

Yellow Fever (YFV) virus ("the black vomit") is now a disease of the tropics, but it was once common in the US. 10% of Philadelphia's population was lost in an epidemic in 1793, and even more in New Orleans some years later.

In the mid-1930s, Max Theiler found that YFV could grow in mouse embryos. So it was passaged from one mouse embryo to another, and then it was found that, somehow, the virus acquired the ability to grow in chicken embryos. So one general approach is to try to grow the virus in some kind of cells (anything): human, if necessary, or monkeys or mice and then adapt them to grow in chicken cells.

The chicken cells can either be in a whole embryonated hen's egg (i.e with a growing embryo), or cells growing in a culture dish. Here is a picture showing the different sites within the egg that are suited for different viruses (I believe these are mostly viruses that have been adapted to grow in eggs already).

I haven't read enough to know, but I would suspect that cultured cells for virus were usually CEF (chick embryo fibroblasts), which are easy to prepare. You mince the embryo, first removing the head, and put the pieces in culture. A few days later you harvest the growing cells by treatment with the enzyme, trypsin. A few cells are transferred to a new flask. Now, you have fibroblasts which will grow for a number of generations, and no more embryo. These are called primary cultures.

Today a number of cell lines have been derived from chicken that will divide forever. There is a famous cell line from humans called HeLa which you may have read about.

The measles virus (MV) was first grown in human kidneys in tissue culture, then in human placentas in tissue culture, and then in chicken eggs. Later, it was adapted to primary cultures of CEF. In rare cases cells derived from an aborted human embryo have been used.

Another approach is to adapt the virus to growth at lower temperatures. Due to paywall restrictions, I haven't been able to read much of the literature on this, but I believe it was done in CEF growing at 25°C. There is a well-known influenza live virus vaccine of this type.

So, a virus that normally infects humans and causes disease is adapted to grow in chicken cells, or adapted to grow at 25°C instead of 37°C, or both. Afterward, you may find that the procedure yields a virus that no longer grows very well and does not cause disease in humans.

This may be because a virus can specialize in one or the other but not both. Or it may be that during prolonged replication mutations accumulate that affect grow under the original condition, but these are not selected against as they would be in the original host or at the original temperature.

Molecular biology

Surprisingly little is known about the molecular basis of attenuation. Probably that's because such work requires a system for reverse genetics. That would be some DNA-based clone where the mutations to be tested could be introduced, followed by a method to produce the live (typically RNA) virus. I've written about a new system of this type for SARS-CoV-2 where the clones are maintained in yeast.

In addition, the gold-standard would be to test the virus in primates like monkeys. That's really expensive and would need to fully justified. Without a strong need to know, it might be hard to get approval for such a study today.

One case where new antigens are substituted into an attenuated virus is the live influenza vaccine.

Influenza is a segmented virus. If two different strains of influenza infect the same cell, you can get reassortment. Suppose we start with a known attenuated mutant (due to changes in PB1 and PB2), and coinfect with an influenza virus whose HA and NA we want in the new vaccine. Just take the progeny viruses, clone them (propagate descendants from isolated single viruses), and then choose the one with the right genes: PB1 and PB2 from virus 1, and HA and NA from virus 2.

In summary, attenuation has been widely used but is still more magic than science. The best thing would be if an attenuated vaccine for the original SARS had been developed. Then you could just substitute the new RBD (receptor binding domain) and try it out. Unfortunately, it doesn't seem that was ever accomplished.

Tuesday, May 12, 2020

Measles virus

This is the first of a short series of posts in which I give some basic information on viruses that cause serious disease in humans. It's motivated by the current pandemic with SARS-CoV-2.

The focus is on viruses that are human-specific but may have "jumped" species, the development of live attenuated virus vaccines, and the general nature of host restriction.

In this post, we'll talk about measles. Measles is commonly known as rubeola (not to be confused with rubella), but also red measles and "English measles".

Morphology

Measles virus is a single-stranded negative-sense, enveloped RNA virus. The measles virus is a member of the Paramyxoviridae, family Morbillovirus. Below on the left is measles virus, its relative mumps virus (also a Paramyxovirus) is on the right.

Pathogenesis

Typically the first symptoms of measles include high fever of about 4 days duration, a characteristic rash, and what are called the "three C's": cough, coryza (runny nose) and conjunctivitis.

Here are two pictures showing the characteristic rash.

The rash is called "flat" because

A maculopapular rash is a type of rash characterized by a flat, red area on the skin that is covered with small confluent bumps. It may only appear red in lighter-skinned people. The term "maculopapular" is a compound: macules are small, flat discolored spots on the surface of the skin; and papules are small, raised bumps. It is also described as erythematous, or red.

There is a special pathognomonic sign called Koplik's spots, seen inside the mouth on the cheek next to the molars. Pathognomonic means it is characteristic enough to make a diagnosis by itself. However, the spots are transitory and frequently missed.

Epidemiology

Measles is a highly contagious infectious disease. Nine out of ten people who are not immune and share living space with an infected person will be infected. People are infectious to others from four days before to four days after the start of the rash.

Not only are presymptomatic individuals infectious, but the virus spreads by aerosol, meaning that it can survive in an infectious state even in the very small droplets that waft around and don't fall to the ground within a few minutes. The particles can stay airborne for hours. This ability is unusual, and indicates the virus resists inactivation due to drying out.

The CFR (case fatality rate) ranges from 1-3/1000 for a well-nourished, healthy individual, to as much as 10% or more, for other populations. Vitamin A-deficiency is very problematic, and supplementation is recommended.

According to wikipedia, measles killed 20 percent of Hawaii's population in the 1850s. In 1875, measles killed over 40,000 Fijians, approximately one-third of the population.

Typically the first tissue infected is the lining of the airways, but the virus eventually travels through lymph nodes, infects cells of the immune system, and then moves into the blood causing widespread viremia.

Bacterial pneumonia is one of the common sequelae, and that's what most people die from. Other problems include ear infections, blindness, severe diarrhea, encephalitis (1/1000) and problems in pregnancy. In very rare cases (1/1M), measles can reactivate years later to cause SSPE (subacute sclerosing pan-encephalitis).

Host restriction

Measles virus infection is presumed to be sustained through an unbroken chain of human-to-human transmission, and no animal or environmental reservoir is known to exist. However, nonhuman primates can be infected with measles virus and can develop an illness similar to measles in humans with rash, coryza, and conjunctivitis. Many primate species are susceptible to measles virus infection, including Macaca mulatta ... Much of the evidence for the susceptibility of these nonhuman primates comes from laboratory colonies and the use of nonhuman primates as animal models for the study of measles virus pathogenesis.

One of the most interesting aspects of measles epidemiology is that the virus is so infectious, it runs out of hosts if the population is too small. With a larger population, it comes back every few years as a new crop of susceptible hosts develops.

To provide a sufficient number of new susceptibles through births to maintain measles virus transmission in humans, a population size of several hundred thousand persons with ∼5000–10,000 births per year is required

Surveys of wild populations have sometimes revealed non-human primates with antibodies to measles virus. It is believed that the virus was spread from humans to one of these animals, followed by limited spread and then die-out.

ref

Vaccine

The first laboratory to grow the virus was that of John Enders and colleagues. They also were first to culture poliovirus, which lead to work on the vaccines by Salk and Sabin. Enders et al received the Nobel Prize in 1954 for this work.

The vaccine strain is named for the boy from whom that virus was cultured, Edmonston.

The virus was weakened by successive culture in
- human kidneys
- human placenta
- hen's eggs
- chick embryos

Although significantly weakened by this serial culture, it still caused rash and fever, sometimes high enough so that children had seizures.

The first thing Hilleman did was give the vaccine together with gamma globulin from people who had recovered from measles. He then passed Enders' measles vaccine strain through
chick embryo cells more than 40 times.

Vaccinated is a biography of Hilleman. It tells the story of Hilleman obtaining specially-bred chickens that were free of chicken leukemia virus.

The vaccine is highly effective.

Despite significant diversity of virus isolates, Measles virus remains a monotypic virus for which protective immunity is induced by vaccine strains first isolated in the 1950’s.

Origin

The genus Morbillivirus includes similar viruses that infect dogs, cats, whales, seals and cattle. The disease of cattle is referred to by a Boer term: Rinderpest. Of the relatives, the rinderpest virus is the closest to measles.

At NCBI I searched for measles and found 363 genome nucleotide sequences, most of which appear complete. I just chose one at random, NC_001498, and then got the sequence of its nucleocapsid gene, NP_056918. A BLAST search gave a large number of hits with other Measles virus isolates down to 97% identity.

Restricting the search to Rinderpest (taxid: 11241), I got numerous hits as well, in the range of 75-80% identity. But if you look at the alignments, there is a C-terminal region that diverges. The N-terminal 400 aa (of 524) matches very well. Restricting the search to 1-400 the matches were more like 88% identical, like this one:

Here is a phylogenetic tree of Morbilliviruses from this review:

One can often recognize present-day diseases in descriptions from ancient times, but measles is missing from those accounts. The first systematic description of measles, and its distinction from smallpox and chickenpox, is credited to the Persian physician Rhazes (860–932), who published The Book of Smallpox and Measles.

By analyzing the diversity of the sequences of viral isolates, it is believed that the last common ancestor of Measles virus and Rinderpest occurred about 1000 AD plus or minus. It is also thought that the virus "jumped" from cattle to humans, due to domestication of livestock and growth of the human population to a level that could support the virus.

Links

Enders Nobel prize and lecture
Enders biography
Hilleman biography
Hilleman obit

Monday, May 11, 2020

Introduction to animal viruses

A long time ago, in a place far far away, I used to give lectures to medical students about microbial physiology and genetics. I also gave two lectures introducing the viruses that infect humans: not much about disease yet, but morphology and replication strategies, and so on.

Here is one figure I used, it is a cartoon of what various RNA viruses look like in the EM, drawn to scale. (It's from Lange). The morphology is quite diverse.

Our attention is currently focused on the one in the middle of the top row. Now, there are a lot of properties that viruses have: is the genome RNA or DNA, single- or double-stranded and so on. Also shape of the capsid, lipid envelope or not.

I had a hard time remembering all this stuff (I actually could not), so I made up a picture that was successful. This is the general organization for RNA viruses.

Here is how I used it.

So the idea is, you remember the order of the different viruses in the table, + sense on top, - sense underneath, with one double-stranded at the end of the second row.

Then you memorize a pattern of active dots for each property that you need to know. I required them to know which viruses were enveloped, and which had segmented genomes.

There is another thing that's a bit of a detail, but some may find it useful. There is so much material in lectures, especially to medical students, that it has been described as "trying to drink from a fire hose." I am very sympathetic to them on this issue. I adopted the strategy to color-code text on slides: blue means you must know this, black means it's important and you may need to know this, and gray means it's something I want to talk about but you do not need to know this. And then another color, like salmon, for the title of a slide to tell what it's about.

Here's another summary slide showing the Arboviruses. These are "arthropod-borne" viruses (i.e. insect-borne).

This is the sum total of Coronavirus information (characteristics of the infectious process were taught later, in the context of lung infections).

Finally, here's another cartoon of DNA viruses.

I was pretty proud of myself for coming up with that aid to memory.

I have put links to the two lectures up on Dropbox. I can't guarantee they'll stay up for ever, but we'll see. Introduction to viruses Virus systematics