Monday, April 12, 2021

Ptolemy again

 My last post was about Ptolemy's theorem.  In the meantime, I've come across a cool "proof without words" for this.


I can see how, knowing that we want ac + bd could lead you to work on scaling triangles into the parallelogram, which would then give all the angles for the central triangle (although note the switch in order for gamma and delta).

It's simply beautiful.

Monday, April 5, 2021

Ptolemy's theorem

 Ptolemy's theorem says that for a cyclic quadrilateral, the product of opposing sides, summed, is equal to the product of the diagonals:


Proof. adapted from wikipedia (here).

Let the angle s (red dot) subtend arc AB and the angle t (black dot) subtend arc CD.  Then the central angle DPC = s + t and it has sin s + t.  The other central angle APD has the same sine, as it is supplementary to s + t.

Let the components of the diagonals be AC = q + s and BD = p + r.  

Twice the areas of the four small triangles will then be equal to

(pq + qr + rs + sp) sin s + t

Simple algebra will show that 

(pq + qr + rs + sp) = (p + r)(q + s) = AC BD

The product of the diagonals times the sine of either central angle is equal to twice the area of the quadrilateral.  We're on to something.

Now, the great idea.  Move D to D', so that AD' = CD and CD' = AD.  

The triangles ACD and ACD' are congruent, by SSS, so they have the same area.

Therefore the area of ABCD is equal to the area of ABCD'.

Some of the angles switch with the arcs.  In particular, angle t (black dot) now subtends arc AD'.  As a result s + t is the measure of the whole angle at vertex C.  The whole angle at vertex A is supplementary, and the sine of the whole angle at vertex A is equal to that at C.

So twice the area of triangle ABD' is AB AD' sin s + t, and twice that of BCD' is BC CD' sin s + t.  Add these two areas, equate them with the previous result, and factor out the common term sin s + t:

AC BD = AB AD' + BC CD' 

But AD' = CD and CD' = AD so


This is Ptolemy's theorem. 

Here is one result from the theorem.  Draw an equilateral triangle and its circumcircle.  Pick any other point P on the circle and connect it to the vertices as shown.

Ptolemy says that

ps = qs + rs

p = q + r

Monday, March 29, 2021

Image conversion for FB

I got an iPhone 11 Pro about 18 months ago.  I love the quality of the images.  Low-light is still more amazing.  

But the image file type is .HEIC.  I have a recollection that Facebook didn't accept .HEIC at the time, when trying again now I find that it does.  In any event, my workflow since then has included opening all the images up in Preview and manually exporting them to relatively low quality .jpg files, one by one.

I finally got tired of this.

I found a page where the guy shows how to set up the macOS Automator app to tell Preview to do the conversion.  Some day I might try that.  But it got me thinking.  

I used trusty old Homebrew to install imagemagick.  (It took a while, there are *a lot* of dependencies).  Anyhow, so then I just cd to the Downloads directory and do this in Terminal:

magick convert -quality 10 IMG_9836.HEIC IMG_9836.jpg

It works great.  I get a 10-fold compressed jpg compared to the original.  So even if FB can now handle .HEIC I'm saving all that bandwidth on the upload.

The next problem was how to specify the new filenames for batch mode.  One could use sed, but it's so awkward to read.  Or I could always go to Python to bundle up the processes, but eventually I found a one-liner that works in Terminal:

for f in *.HEIC;  do magick convert -quality 10 $f ${f:0:9.}jpg;  done

The first filename is the source, like IMG_9836.HEIC.  The second filename is ${f:0:9.}jpg, it extracts characters at specified positions from the string variable.  I didn't know you could do this in the shell.

It's a hack, but it works in this case because the variable names are all exactly the same length.  Anyway, it's great.  As long as you're not afraid of Terminal.

Thursday, January 14, 2021

Senator Hawley, provocateur

Sen. Josh Hawley of Missouri voted to object to the certification of the ballots of PA electors.  A statement from him that was printed in a Missouri newspaper is also available on Twitter.  Hawley claims that when the PA legislature authorized mail-in ballots what they did was unconstitutional. 

The "Pennsylvania Supreme Court .. changed the rules for when mail-in ballots could be returned ... the Pennsylvania Supreme Court threw out the case on on procedural grounds, in violation of its own precedent.  To this day, no court has found the mail-in voting scheme to be constitutional, or even heard the merits of the case."

It is true that they tried to change the rules for when mail-in ballots could be returned, but that is a different case.  Hawley is deliberately conflating two different issues.  The question about timing of return was addressed by the U.S. Supreme Court. This case is about whether absentee voting is constitutional.  (Hawley is being dishonest).

So then, here's a bit of history about absentee voting (voting by mail, or mail-in ballots) in PA.  You can read what the PA Constitution says about absentee voting here.

"The Legislature shall provide a manner in which .. qualified electors .. [who] are unable to attend at their proper polling places .. may vote, and for the return and canvass of their votes".

And it gives examples of reasons:  "because of illness or physical disability or who will not attend a polling place because of the observance of a religious holiday or who cannot vote because of election day duties, in the case of a county employee".

Sen. Hawley relies upon a restrictive reading of this text where he claims that what the PA legislature did was unconstitutional (that only the listed reasons are valid).  There has been a lot of discussion about why that is not the case (speech by Sen. Casey, good newspaper review).  It is true that the PA Supreme Court has not actually ruled on this issue.

 In October, 2019 (not "last year"), the PA legislature passed, and the governor signed into law Act 77 of 2019 in Pennsylvania (P.L. 552, No. 77).  You can read the text here, but like most legislation it is frankly, unreadable.  It provides for universal mail-in ballots.

It is noteworthy that the bill was introduced by Republicans and passed with majority support by Republicans.  Several previous elections have been run using these procedures.

On Nov. 21 (18 days after the 2020 presidential election) a suit was filed (by U.S. Representative Mike Kelly and others) making the claim of unconstitutionality.  The PA Supreme Court ruled against them on Nov. 28, on the grounds that their claim was untimely.  You can read the Order here.

"As a remedy, Petitioners sought to invalidate the ballots of the millions of Pennsylvania voters who utilized the mail-in voting procedures established by Act 77 and count only those ballots that Petitioners deem to be “legal votes."

In a concurrence, Justice Wecht wrote

"Unsatisfied with the results of that wager, they would now flip over the table, scattering to the shadows the votes of millions of Pennsylvanians,"

In summary then, Republicans voted to allow mail-in ballots in PA, ran several elections this way without challenging (their own) law, and then filed suit to challenge the process, more than two weeks after their preferred candidate for President lost the election in PA.

Not only did the PA Supreme Court dismiss this case, but it is a fundamental tenet of election law (as promulgated by the U.S. Supreme Court), that authorities may not change the rules after the election.  Voters are entitled to rely on the rules established beforehand.

You can read a good post about election law here, and an NPR article about the recent case here.

Sen. Hawley is dishonest, but you knew that.  Was this done deliberately by the GOP with the plan to try to overturn the election?  Who knows?  

What seems clear is that the modern GOP is the party of voter suppression, the party that (arguably) wants to return to Jim Crow, by disenfranchising Black voters.

Thursday, December 24, 2020

Archimedes and pi

 I've written a lot about Archimedes and the approximation of pi over the years (appropriated from the web, actually).

Most of that material is in my Analytic Geometry book.  I am in the process of re-doing the Github repos for the books, so they're not currently up-to-date, but that will change over the next week or so.  You can find me on Github here.

In the last couple of days, I wrote a summary of the math of Archimedes with respect to pi, including how his formulas connect with others based on perimeter and area that are due to Gregory.  There is also extensive discussion of methods to approximate the square root of 3, essential to Archimedes work.

That pdf is on Dropbox here.  Enjoy, and as always, comments would be welcome.

Monday, December 14, 2020

Pain with mac OS

I've run Macs ever since 1984.  Never had anything else except one Raspberry Pi running Linux.  

But I'm alarmed by clear signs of decay in QA for Mac system software.  Of course, they say you get what you pay for.  :)  But Apple is a hardware company and I've gladly paid a premium for a beautiful OS and well-designed hardware.

My current issue (though not the first) is the latest mac OS 11 (Big Sur).

So there I was, minding my own business, running Catalina and streaming video through the browser, Safari, when out of nowhere I got the spinning beachball.  Machine is unresponsive, unable to kill the app, transitioning to a black screen of death, which was new to me.  Bizarrely, it had a cursor which responded to the trackpad.

Luckily, there is this thing called googling.  Turns out you can press the power button for 10 seconds and the laptop will shut down.  Reboot and it seems fine.  So what happened?  The internet says that it has to do with system software updating, some conflict that happens with sleep.

I go to System Prefs > Software Update and it looks like this:

I never set this.  My 1 yr old laptop came with this as the default.  So the auto-updater crashes the machine.  Apparently it is downloading/installing Big Sur.

I uncheck everything.  Then, some demon tempted me to download and install Big Sur myself, which I did manually from the main screen for this preference.  Repeatedly (3x), I download 12 GB, the install fails, and then the 12 GB is deleted (or at least, neither I nor Update knows where it is).  You gotta start over.  The third time, I am on it when the download finishes, and I hit the button, so it goes.  

Not crazy about the updated UI, but that's life.

Now, one day later, there's an update.

Since auto-update can crash, I have only the first checkbox selected in the first window, above.  

I manually download the 11.1 update (Update Now).  Three times through, I download 3.x GB, then it stalls.  Right here.


I don't have the patience to do anything more.  I'm just angry.  How hard could it be to set up a fake install and test whether it works properly?  And then to throw it away each time.  They just don't frickin' care anymore.

Saturday, July 25, 2020

Rosalind Franklin's centenary

Today is the 100 year anniversary of the birth of Rosalind Franklin, and I have something to say about Nobel Prizes and scientific reputations.

I think we can agree, from the start, that prizes for science suck.  In particular, the Nobel Prize sucks.  Some prizes have been given to yahoos.  Other prizes have been given to mentors but not to the juniors who actually did the work, and in some of those cases the juniors had the great ideas.  See Enders, John for a counter-example.

Some prizes were limited because more people deserved them than just the maximum of 3.

Bob Gallo should have been awarded the Nobel Prize, he invented every tool for studing HIV that anyone ever used.  But he was an asshole, and the Swedes decided to make a point.

Rosalind Franklin arguably deserved the Nobel Prize.  Her major problem was that she died of cancer several years before the prize for DNA was awarded, and there is a rule against posthumous awards.

One often sees (now more frequently) the claim that Watson and Crick "stole" Franklin's data and received recognition for discovery of the double helix when it was rightfully hers.  It's feminist dogma.

It's also revisionist nonsense.

It is false that Franklin discovered the double-helix.  What she did was to take a beautiful X-ray picture of DNA that Crick knew immediately implied the double helix.  Franklin herself was unsure.  She wrote "if a helix..."  Furthermore, this would be her only contribution, seminal to be sure, but her only one.

Franklin was in some respects in competition with Watson and Crick, and she was definitely unhappy when it became known that her boss Wilkins (she disputed his authority), and Max Perutz, had made her picture available to Watson and Crick.  It had already been shown in local seminars.  Perutz, as straight an arrow who ever lived, was not bothered about how things went down.

Beyond that, the discovery of the double helix relied on deep crystallographic and structural knowledge of Crick, as well as the enormous luck of Watson.  He had no idea what the picture meant, but he knew about Chargaff's ratios, and he knew the correct structures for the bases in solution.  This was crucial.  The ultimate aha moment was Watson's, much as I would wish otherwise.  (Because he is also an asshole).

Franklin was, in years shortly following the events, friendly enough with Crick and his wife that they vacationed together.  There was no animosity between them about DNA.  She died 5 years later, in 1958.

If you dispute any of this, go re-read Judson's book and Perutz's writing about this and get back to me.

The idea that Watson and Crick stole the double helix from Franklin is complete nonsense.

Sum of angles, revisited

I've run into a number of different derivations for the "sum of angles" formulas over the years.  These are 

cos s + t = cos s cos t - sin s sin t
sin s + t = sin s cos t + cos s sin t

For example, here is one using Euler's formula, and here is a derivation of the difference version of the cosine that I encountered in Gil Strang's Calculus textbook.  As I've said, I find it only necessary to memorize one:

cos s - t = cos s cos t + sin s sin t

which is easily checked for the case where s = t.

Recently I ran into a couple more.  Probably the simplest is this one:

Two stacked right triangles with a surrounding rectangle.  The upper triangle is sized so that the hypotenuse is 1 and the sine and cosine are obvious.  For the triangle with angle phi, we need an extra term in the sine and cosine so that when dividing, say, opposite/hypotenuse, the result is correct.

The angle theta + phi is known by the alternate interior angles theorem, and the small triangle with angle phi is known by a combination of the complementary and supplementary angles theorems.

Now, just read the result.  One diagram gives both formulas!  

There is also a simple derivation for the sine formula based on area calculations.  We calculate the area of this triangle in two ways:

On the left, we have that

A = (1/2) a sin (theta + phi) b

On the right (h = a cos phi = b cos theta) and

A = (1/2) h a sin phi + 1/2 h b sin theta 
  = (1/2) (b cos theta a sin phi + a cos phi b sin theta)

Equating the two and factoring out (1/2)ab, we obtain the addition formula for sine.  (Unfortunately, I've lost the links where I saw these).

Here is a link to a rather comprehensive chapter on the subject from my geometry book (Github).

Thursday, June 25, 2020


I've been working with maps using Python, primarily maps for the United States.  The standard format for much geographic data is GeoJSON.  

But there is another format that is even more official, and is maintained by the US Census Bureau.  That is a collection of Shapefiles.

To recap, here is a site where GeoJSON files for the Us are available in multiple sizes (small, medium, large), as well as their .kml and Shapefile .shp equivalents.  The sizes are 500k, 5m, 20m, from largest to smallest (the labels are the scales, 500k being the most detailed).
The files were obtained from links on this page.  It includes data for the US, for the states, and for counties.  And it contains Congressional districts, which would be useful to remember.

A Shapefile is binary-encoded geographic data in a particular format.  A good discussion is here.

The specification was developed at least partly by ESRI, which develops geographic information software..  The encoding was undoubtedly designed to save space, back when space (storage, transmission bandwidth) was much more expensive.  Now, the opaque data is a liability.


There is actually not just one file, but always a minimum of three including .shp, .shx and .dbf inside a zip container.  The Shapefiles from the US Census also have .prj and .xml.

I can't tell much by looking at one with hexdump, except that most of it is aligned (in sections), on 16-byte boundaries.  The format is described at this link, but I haven't worked with that.

One way to open and read a Shapefile is to use geopandas.  I grab that with pip3 install geopandas

The example is

>>> d = 'gz_2010_us_040_00_20m'
>>> fn = d + '/gz_2010_us_040_00_20m.shp'
>>> df = gpd.read_file(fn) 

The columns are:
  • GEO_ID
  • NAME
  • LSAD
  • geometry
>>> df.NAME.head()
0        Arizona
1       Arkansas
2     California
3       Colorado
4    Connecticut
Name: NAME, dtype: object

For some reason the order is several joined partial lists of states, each one alphabetized.

We need to extract the coordinates for a particular state:

import geopandas as gpd
fn = 'ex.shp'
df = gpd.read_file(fn)
sel = df['NAME'] == 'Maine'
g = me = df.loc[sel].geometry
from shapely.geometry import mapping
D = mapping(g)
for f in D['features']:
L = f['geometry']['coordinates']

for m in L:

> python3
(-69.307908, 43.773767)
(-69.30675099999999, 43.775095)
(-69.42792, 43.928798)
(-69.423323, 43.922871)

Pythagorean theorem redux

Here is a cool proof I saw on Twitter, it was an RT by @StevenStrogatz from this guy:

Take a generic right triangle.  Flip, rotate and scale it by multiplying each side by a factor of b.  I did this by imagining that a = 1 and then b = ab.  The complementary angles are marked with circles on the right.

So then construct a rotated triangle from the same input, but scaled by a factor of a, and attach it to the other one.  We know that the sides marked ab are parallel, and that the angle between bc and ac is a right angle, by the properties of right, complementary and supplementary angles.

The four outside vertices form a rectangle, from the angles and also since ab = ab.

Finally, rotate and scale again, by a factor of c:

Slide them together

Albers projection

The standard projection used in converting latitudes and longitudes on the (roughly) spherical earth to a planar map depends on what you’re projecting.

Most people know about the Mercator projection.

The one used extensively for maps of the United States is called the Albers Equal-Area Conic projection.

The wikipedia article gives some formulas:

You must first choose two reference latitudes.  Latitudes are referred to by the letter phi and longitudes by the letter lambda.  So the two references are phi_1 and phi_2. 

You also choose a center for the map, at phi_0, lambda_0.

From these four values you calculate n, C and then rho_0, which are each the same for every transformation in this projection.  

Finally, for each coordinate phi, lambda one calculates rho and theta and then finally

x = rho sin theta
y = rho_0 - rho cos theta

This, however, is for the assumption of a spherical earth.  The equations for an ellipsoid are a bit harder.

The wikipedia article gives this url for a pdf and I found the same report referenced in this answer to a question on Stack Exchange.

so that was lucky, because it gave me not only the equations but also a worked numerical example for each, which helped greatly in finding my mistakes in the code.

The math is explained in a write-up done in LaTeX, as a Dropbox link to a pdf.  The math has been encoded as Python scripts and here.

Here are screenshots from the manual:

The test is a latitude of 35°N., -96°W., which should give a result (for the ellipsoid) of 

> python 
p0: 23.0
l0: -96.0
p:  35.0
l:  -75.0
x:  1885472.7
y:  1535925.0

The first part of the output is the center we chose.  The result for x and y matches the source.

So then, for a script in the same directory as, we can do import project and do 

> python3 counties.geo.txt AL

No longer squashed.

Naturally, there is software out there that will do this.  In particular, the Proj transformation software.

I obtained it with Homebrew

> brew install proj

> echo 55.2 12.2 | proj +proj=merc +lat_ts=56.5 +ellps=GRS80
3399483.80 752085.60

which matches their example.

Of course, we really want the Albers equal area conical projection based on the ellipsoid.
> echo -75 35 | proj +proj=aea +lat_1=29.5 +lat_2=44.5 +lat_0=23 +lon_0=-96 +ellps=clrk66
1887211.95 1533994.75
These are close but don't quite match.  I will have to explore Proj more to know why.

Put the input data into a file and do:
> cat data.txt | proj +proj=aea +lat_1=29.5 +lat_2=44.5 +lat_0=23 +lon_0=-96 +ellps=clrk66
1887211.95	1533994.75

Plotting polygons

In the previous two posts (here and here), we looked at geographical heat maps (choropleth maps), constructed using plotly.  The purpose was to visualize COVID-19 data.  Now, we're going to dive deeper into maps, and for that purpose I made a new github repo.

As we said, it is easy to use to make a map, say of the entire United States, or countries of the world, or US counties.

However, there are some limitations.  plotly is developed by a company whose main product appears to be Dash, which aims to be a fancy UI layer for working with data science overlying the basic widgets and libraries.

Mapping, in particular, is an issue.  While is easy, it is also limited.  

Once you want to dive deeper (for example, to zoom in on a map of a single state before display, or add labels to a map), you run into the fact the express layer doesn't allow it, and that the underlying maps are actually from a company called Mapbox.  

Mapbox wants to be a kind of Google maps that you embed in your iPhone app, only better.  They provide an introductory free level, but you have to sign up and obtain a Mapbox Access Token for many things.  

I decided to go without.  For the moment, I'm going to continue with plotly, although ultimately I'm leaning toward the position that matplotlib will be a better choice.

So what we're going to try here is to plot the map components (states) as polygons.  We'll run into some issues and solve them.

The first script is

There's nothing special about it except (i) it deals with the encoding issue we talked about before and (ii) it has a significant bug, which we will diagnose and fix.

> python3 counties1.json > counties1.geo.txt

-86.496774 32.344437
-86.717897 32.402814
-86.814912 32.340803
-86.890581 32.502974
-86.917595 32.664169
-86.71339 32.661732
-86.714219 32.705694
-86.413116 32.707386
-86.411172 32.409937
-86.496774 32.344437

Now we'll just use plotly to draw all the polygons for the state of Alabama using 

> python3 counties1.geo.txt AL

That looks kind of promising, it is recognizably Alabama.

There are a couple of problems.  The first is that the shape is squashed.  

And the second, more subtle, is that there is a missing county at the southwest corner:  Mobile.

This issue becomes more obvious when we try California.

That's definitely wrong.  We are missing Los Angeles Ventura and Santa Barbara counties, and if you look closely, San Francisco is missing as well.

The solution came to me when I realized that the term "MultiPolygon" had been mentioned, and I hadn't really understood what it was.  I had just adjusted for the case when the coordinates were wrapped by four '[[[[ .. ]]]]' versus three '[[[ .. ]]]' without thinking about it.

As an example, let's look at entries in the GeoJSON data that are MultiPolygons (most are just plain Polygon).  It is Bethel (a County or Census Area) in Alaska.  Here is a list of the first 18:

Bethel, AK
Hoonah-Angoon, AK
Juneau, AK
Kenai Peninsula, AK
Ketchikan Gateway, AK
Kodiak Island, AK
Lake and Peninsula, AK
Nome, AK
Petersburg, AK
Prince of Wales-Hyder, AK
Sitka, AK
Valdez-Cordova, AK
Wrangell, AK
Mobile, AL
Los Angeles, CA
San Francisco, CA
Santa Barbara, CA
Ventura, CA

We notice that Mobile and the four California counties are also on the list.

A map of Bethel reveals that it has not only an island but also a lake!

Bethel County (actually Census Area), Alaska is a MultiPolygon and has three arrays

 [[[-173.116905, 60.516005] ..  36 items .. [-173.116905, 60.516005]]],
 [[[-165.721389, 60.16962]  .. 128 items .. [-165.721389, 60.16962]]], 
 [[[-160.534142, 61.947257] .. 314 items .. [-160.534142, 61.947257]]]

Bethel County includes a large island called Nunivak Island.

The city of Mekoryuk on Nunivak has a lat,lon of 60.370411,-166.257202.  This suggests that the second list is the polygon for Nunivak.

[-166.310655, 60.377611], [-166.200019, 60.393404] is one adjacent pair of vertices in the second list.

I thought from the restricted range of values in the first list that it was probably the lake, but it's not.  It's a second, very small island.  I made a fake GeoJSON file with just Bethel County and then plotted it:

So the prediction is that Bethel County won't render properly by my script.  It's not the greatest test.  Most of Alaska is messed up (not shown)

I go back and fix and then run the plotting script:

python3 counties.geo.txt CA

We have found the missing counties.

The next problem to solve is the squashed forms of the states.  The reason for that is that the latitudes and longitudes refer to the globe (an oblate spheroid), while the map is a projection of those points onto a 2-dimensional surface.  There are a variety of projections and the math is pretty complicated, so we will save that for next time.

Wednesday, June 24, 2020


GeoJSON is a JSON representation of geographic data.  Although JSON was developed for javascript, the format should be very familiar to any Python programmer.  There are collections:  dicts and lists, as well as simple types like:  string, number, boolean or null.  Dictionaries are called objects, and may also be values contained in lists or other dictionaries.

A typical GeoJSON file looks like this:
     "geometry": {"type":"Polygon",
                  "coordinates": [[ [-86.496774, 32.344437],
                                    [-86.496774, 32.344437]
                   "STATE": "01", 
                   "COUNTY": "009", 
                   "NAME": "Blount"
(I've formatted whitespace to my taste, YMMV).

At top level, it is a dict with two keys.  The "type" is "FeatureCollection".

The second key is "features", which is a list of Feature objects.

The "id" is a FIPS code for the Feature.  This Feature is Autauga County, Alabama.  The FIPS code for the state is "01" and the full FIPS for the county is "01001".

Each Feature is a dict with four keys.  The "properties" and "geometry" keys yield dicts in turn.  The "geometry" dict has a key "coordinates" which gives a list of longitude and latitudes for each vertex of the Feature's map.

The coordinates are nested inconsistently.  It it's two deep:  a list with one element that is a list with many 2-element vertices.  Sometimes it's three deep.

[ Update:  this is a great over-simplification.  The problem is that these are Multipolygons, and they do mean multi.  That will have to wait for another post.  Here's a link to the Mapbox documentation. ]

The number of vertices depends on the resolution.  For some counties at high resolution, it could run to 300 or more tuples.

For use with, there's no reason to parse this stuff or to generate a file with only the desired Features.  

Just load the data

import json
with open(fn,'r') as fh:
    counties = json.load(fh)

then make a pandas data frame with one or more elements like

    fips  value
0  01001      1
1  06071      2


import pandas ad pd
pd.DataFrame({'fips':list_of_fips, 'value': list_of_values})

(or just pass {'fips':list_of_fips, 'value': list_of_values} as the df argument to choropleth).

fig = px.choropleth(
    color=["Autauga","San Bernardino"],
The locations argument says to match the "fips" from the data frame with the default value in the GeoJSON, which is its "id".  The colors can be specified as a color list like

cL = ['green','magenta']

See the plotly tutorial section on colors for more detail.  The scope argument limits the region plotted on the map.

States are even easier, since plotly already knows the state GeoJSON data.

import as px
fig = px.choropleth(

The county GeoJSON data is available from plotly, see the first example for the URL).

I came across a collection of GeoJSON files on the web.  This data is originally from the US Census, and has been converted to GeoJSON by the author of that post.

One slight hiccup is that the GeoJSON file includes Puerto Rico, which has municipalities with Spanish names.  These are encoded with ISO-8859-1.  

This uses high-order bytes to represent Spanish ñ, í and so on.  Some of these byte sequences are not valid UTF-8.

> python3
Python 3.7.7 (default, Mar 10 2020, 15:43:33) 
>>> fn = 'gz_2010_us_050_00_20m.json'
>>> fh = open(fn)
>>> data =
Traceback (most recent call last):
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 935286: invalid continuation byte

Take a look:

>>> fh = open(fn,'rb')
>>> data =
>>> data[935280:935290]
>>> list(data[935280:935290])
[34, 67, 111, 109, 101, 114, 237, 111, 34, 44]

This is for Comerío Puerto Rico.  The byte (237, 0xed) is mapped to the accented í   But this and the following two bytes are

>>> '{:08b}'.format(237)
>>> '{:08b}'.format(111)
>>> '{:08b}'.format(34)
In UTF-8  the first byte starts with 1110, which indicates that this is a three-byte sequence, but the third byte of the group does not begin with 01 and so is not valid.

There is a nice module to document such issues.  I do pip3 install validate-utf8.
> validate-utf8 src.json
invalid continuation byte
  COUNTY": "045", "NAME": "Comerío", "LSAD": "Muno", "CENSUSARE

To fix this

>>> fn = 'gz_2010_us_050_00_5m.json'
>>> fh = open(fn,'rb')
>>> s = fh.decode('ISO-8859-1')
>>> fn = 'counties.json'
>>> fh = open(fn,'w')
>>> fh.write(s)
>>> fh.close()

For my later explorations, I found it useful to parse the GeoJSON to a flat format with elements like:
-86.496774 32.344437
-86.717897 32.402814
-86.814912 32.340803
-86.890581 32.502974
-86.917595 32.664169
-86.71339 32.661732
-86.714219 32.705694
-86.413116 32.707386
-86.411172 32.409937
-86.496774 32.344437
Shapefiles and conversions are complicated enough that we'll leave them for another post.

COVID-19 data analysis

I've been working on a project to analyze the data on COVID-19 cases and deaths (GitHub repo). The data is aggregated from public health departments by some folks at Johns Hopkins.  Their dashboard is here, and the data are in this directory on Github.

Here is the help page for the covid project (it is the same for most of the scripts):
> python3 --help
-h  --help    help
-n   <int>    display the last n values, default: 7
-N   <int>    display N rows of data: default: 50
-c  --delta   change or delta, display day over day rise
-d  --deaths  display deaths rather than cases (default)
-r  --rate    compute statistics
-s  --sort    (only if stats are asked for)
to do:
-u   <int>    data slice ends this many days before yesterday 
-p  --pop     normalize to population
python [state] -n 10 -sr
And here is the output (today) for that example:
> python3 SC -rs
               06/17 06/18 06/19 06/20 06/21 06/22 06/23  stats
Charleston      1264  1403  1554  1728  1836  2044  2251  0.094
Oconee            95   100   105   110   136   154   142  0.083
Pickens          348   367   429   464   499   529   570  0.083
Calhoun           47    48    58    62    69    73    74  0.082
total          20556 21533 22608 23756 24661 25666 26572  0.043
The statistic is a linear regression on cases, normalized to the mean of the values, and then the counties in my state (SC) are sorted according to the result.  Charleston is my county, and unfortunately, it is the county with the highest rate of growth of cases in the state.  Currently in the US, the top 20 counties are:

> python3 -rs -n 4 -N 20
                   06/20   06/21   06/22   06/23  stats
Thomas, KS             0       0      10      12  0.836
Hot Spring, AR        53      53     138     226  0.514
Holmes, FL            47      47      58     121  0.341
Jim Wells, TX         22      27      34      46  0.245
Brewster, TX          24      24      39      45  0.236
Erath, TX             44      44      44      85  0.227
McDonald, MO         170     366     371     403  0.215
Sharkey, MS            9       9      13      16  0.213
Blanco, TX            14      14      22      24  0.205
Newton, TX             6       8      11      11  0.2
Aroostook, ME         11      17      19      21  0.188
Tehama, CA            34      34      53      54  0.181
Sioux, ND             12      12      19      19  0.181
Okfuskee, OK           7       7      11      11  0.178
Bourbon, KS            9       9      14      14  0.174
Lawrence, MO          11      13      13      19  0.171
Harvey, KS            13      13      20      20  0.17
Pontotoc, MS          93      93     128     146  0.169
Letcher, KY            8       8       8      13  0.162
Live Oak, TX          10      10      15      15  0.16

I chose n = 4 so the output would be formatted correctly for the blog post.

As with any large dataset, there are some problems to work through, which are not solved perfectly yet.  Also, I've focused more on the U.S. lately, so scripts for world data haven't been updated yet either.

What I got interested in and want to show is the generation of maps of the US by states or counties, or one or a few states by counties, where the fill color is based on, for example, the growth rate of cases.  Here is the US by states.

I haven't generated the color bar horizontally yet, I just cut it out and rotated it, so the writing is rotated as well.

This type of map is called a choropleth map.  I stumbled across a python tool for generating maps.  It's part of the plotly library.  It is as simple as 

fig = px.choropleth(

The details are slightly complicated, but not bad. 

df is a pandas data frame that maps states by two-letter abbreviation to the corresponding statistic.
df = pd.DataFrame(data={'state':abbrev, 'value':st})
You need GeoJSON data for a county map (the states are already known to  That data file is available from them.

The colors are mapped to the statistic st as read from the data frame.  The last line of the call to px.choropleth assigns the title to the color bar.

The script for the US states is here and for the counties it is here.

This is the state of South Carolina today.  These colors are an attempt to make the positives pop out more.

There's much more to discuss.  I have always wanted to make a map of the US with my road trips plotted on it, something like this.  For that we need to talk about GeoJSON data and how to obtain it, as well as the Albers projection that is used in making maps.  It turns out that the standard methods from plotly have a significant limitation and I had a really weird bug in my code that I eventually figured out.

Finally, we'll need to find how to generate the data for each individual trip to overlay on the map.  That's all for later.

Monday, May 18, 2020

Virus attenuation

Vaccines and IgA

It is widely held that attenuated live-virus vaccines are the best vaccines, at least for respiratory diseases, because they are able to induce an IgA response.  Plus, nearly all viruses enter the host via mucosal surfaces (oral cavity, nose, gut, lungs), where the first line of defense includes IgA.

Poliovirus is the exception that proves the rule.  Polio grows extensively in the intestine, hence its transmission by the fecal-oral route, resulting in a unique epidemiological role for swimming pools in advanced countries.  However, polio doesn't seem to cause much pathology there.  The major trouble occurs when it moves to the nervous system.

OPV (oral polio vaccine) induces IgA while IPV (inactivated polio vaccine) does not.  This restriction is not a serious problem for the inactivated virus vaccine because it seems that IgG is sufficient to prevent polio's movement to the nervous system.

There is some prospect for development of adjuvants that would drive B-cells to switch to producing IgA.  One of the best understood is cholera toxin, where binding to certain cells stimulates IgA-promoting cytokines (IL-1, IL-6, IL-10).  However these are too toxic for use in humans.  Perhaps someday we'll understand enough to be able get an IgA response with a killed virus, but that moment has not come yet.  ref

Virus attenuation

We are left with the fact that historically, live virus vaccines have mainly been produced via attenuation.  We can look at some modern approaches elsewhere, like smallpox or adenovirus or retroviruses that are already attenuated and can display viral antigens.

Yellow Fever (YFV) virus ("the black vomit") is now a disease of the tropics, but it was once common in the US.  10% of Philadelphia's population was lost in an epidemic in 1793, and even more in New Orleans some years later.

In the mid-1930s, Max Theiler found that YFV could grow in mouse embryos.  So it was passaged from one mouse embryo to another, and then it was found that, somehow, the virus acquired the ability to grow in chicken embryos.  So one general approach is to try to grow the virus in some kind of cells (anything):  human, if necessary, or monkeys or mice and then adapt them to grow in chicken cells.

The chicken cells can either be in a whole embryonated hen's egg (i.e with a growing embryo), or cells growing in a culture dish.  Here is a picture showing the different sites within the egg that are suited for different viruses (I believe these are mostly viruses that have been adapted to grow in eggs already).

I haven't read enough to know, but I would suspect that cultured cells for virus were usually CEF (chick embryo fibroblasts), which are easy to prepare.  You mince the embryo, first removing the head, and put the pieces in culture.  A few days later you harvest the growing cells by treatment with the enzyme, trypsin.  A few cells are transferred to a new flask.  Now, you have fibroblasts which will grow for a number of generations, and no more embryo.  These are called primary cultures.

Today a number of cell lines have been derived from chicken that will divide forever.  There is a famous cell line from humans called HeLa which you may have read about.

The measles virus (MV) was first grown in human kidneys in tissue culture, then in human placentas in tissue culture, and then in chicken eggs.  Later, it was adapted to primary cultures of CEF.  In rare cases cells derived from an aborted human embryo have been used.

Another approach is to adapt the virus to growth at lower temperatures.  Due to paywall restrictions, I haven't been able to read much of the literature on this, but I believe it was done in CEF growing at 25°C.  There is a well-known influenza live virus vaccine of this type.

So, a virus that normally infects humans and causes disease is adapted to grow in chicken cells, or adapted to grow at 25°C instead of 37°C, or both.  Afterward, you may find that the procedure yields a virus that no longer grows very well and does not cause disease in humans.

This may be because a virus can specialize in one or the other but not both.  Or it may be that during prolonged replication mutations accumulate that affect grow under the original condition, but these are not selected against as they would be in the original host or at the original temperature.

Molecular biology

Surprisingly little is known about the molecular basis of attenuation.  Probably that's because such work requires a system for reverse genetics.  That would be some DNA-based clone where the mutations to be tested could be introduced, followed by a method to produce the live (typically RNA) virus.  I've written about a new system of this type for SARS-CoV-2 where the clones are maintained in yeast.

In addition, the gold-standard would be to test the virus in primates like monkeys.  That's really expensive and would need to fully justified.  Without a strong need to know, it might be hard to get approval for such a study today.

One case where new antigens are substituted into an attenuated virus is the live influenza vaccine.

Influenza is a segmented virus.  If two different strains of influenza infect the same cell, you can get reassortment.  Suppose we start with a known attenuated mutant (due to changes in PB1 and PB2), and coinfect with an influenza virus whose HA and NA we want in the new vaccine.  Just take the progeny viruses, clone them (propagate descendants from isolated single viruses), and then choose the one with the right genes:  PB1 and PB2 from virus 1, and HA and NA from virus 2.

In summary, attenuation has been widely used but is still more magic than science.  The best thing would be if an attenuated vaccine for the original SARS had been developed.  Then you could just substitute the new RBD (receptor binding domain) and try it out.  Unfortunately, it doesn't seem that was ever accomplished.

Tuesday, May 12, 2020

Measles virus

This is the first of a short series of posts in which I give some basic information on viruses that cause serious disease in humans.  It's motivated by the current pandemic with SARS-CoV-2.

The focus is on viruses that are human-specific but may have "jumped" species, the development of live attenuated virus vaccines, and the general nature of host restriction.

In this post, we'll talk about measles.  Measles is commonly known as rubeola (not to be confused with rubella), but also red measles and "English measles".


Measles virus is a single-stranded negative-sense, enveloped RNA virus.  The measles virus is a member of the Paramyxoviridae, family Morbillovirus.  Below on the left is measles virus, its relative mumps virus (also a Paramyxovirus) is on the right.


Typically the first symptoms of measles include high fever of about 4 days duration, a characteristic rash, and what are called the "three C's":  cough, coryza (runny nose) and conjunctivitis.

Here are two pictures showing the characteristic rash.

The rash is called "flat" because
A maculopapular rash is a type of rash characterized by a flat, red area on the skin that is covered with small confluent bumps. It may only appear red in lighter-skinned people. The term "maculopapular" is a compound: macules are small, flat discolored spots on the surface of the skin; and papules are small, raised bumps. It is also described as erythematous, or red.
There is a special pathognomonic sign called Koplik's spots, seen inside the mouth on the cheek next to the molars.  Pathognomonic means it is characteristic enough to make a diagnosis by itself.  However, the spots are transitory and frequently missed.


Measles is a highly contagious infectious disease.  Nine out of ten people who are not immune and share living space with an infected person will be infected. People are infectious to others from four days before to four days after the start of the rash.

Not only are presymptomatic individuals infectious, but the virus spreads by aerosol, meaning that it can survive in an infectious state even in the very small droplets that waft around and don't fall to the ground within a few minutes.  The particles can stay airborne for hours.  This ability is unusual, and indicates the virus resists inactivation due to drying out.

The CFR (case fatality rate) ranges from 1-3/1000 for a well-nourished, healthy individual, to as much as 10% or more, for other populations.  Vitamin A-deficiency is very problematic, and supplementation is recommended.

According to wikipedia, measles killed 20 percent of Hawaii's population in the 1850s. In 1875, measles killed over 40,000 Fijians, approximately one-third of the population.

Typically the first tissue infected is the lining of the airways, but the virus eventually travels through lymph nodes, infects cells of the immune system, and then moves into the blood causing widespread viremia.

Bacterial pneumonia is one of the common sequelae, and that's what most people die from.  Other problems include ear infections, blindness, severe diarrhea, encephalitis (1/1000) and problems in pregnancy.  In very rare cases (1/1M), measles can reactivate years later to cause SSPE (subacute sclerosing pan-encephalitis).

Host restriction
Measles virus infection is presumed to be sustained through an unbroken chain of human-to-human transmission, and no animal or environmental reservoir is known to exist. However, nonhuman primates can be infected with measles virus and can develop an illness similar to measles in humans with rash, coryza, and conjunctivitis. Many primate species are susceptible to measles virus infection, including Macaca mulatta ... Much of the evidence for the susceptibility of these nonhuman primates comes from laboratory colonies and the use of nonhuman primates as animal models for the study of measles virus pathogenesis.
One of the most interesting aspects of measles epidemiology is that the virus is so infectious, it runs out of hosts if the population is too small.  With a larger population, it comes back every few years as a new crop of susceptible hosts develops.
To provide a sufficient number of new susceptibles through births to maintain measles virus transmission in humans, a population size of several hundred thousand persons with ∼5000–10,000 births per year is required
Surveys of wild populations have sometimes revealed non-human primates with antibodies to measles virus.  It is believed that the virus was spread from humans to one of these animals, followed by limited spread and then die-out.



The first laboratory to grow the virus was that of John Enders and colleagues.  They also were first to culture poliovirus, which lead to work on the vaccines by Salk and Sabin.  Enders et al received the Nobel Prize in 1954 for this work.

The vaccine strain is named for the boy from whom that virus was cultured, Edmonston.

The virus was weakened by successive culture in
- human kidneys
- human placenta
- hen's eggs
- chick embryos

Although significantly weakened by this serial culture,  it still caused rash and fever, sometimes high enough so that children had seizures.

The first thing Hilleman did was give the vaccine together with gamma globulin from people who had recovered from measles.  He then passed Enders' measles vaccine strain through
chick embryo cells more than 40 times.

Vaccinated is a biography of Hilleman.  It tells the story of Hilleman obtaining specially-bred chickens that were free of chicken leukemia virus.

The vaccine is highly effective.

Despite significant diversity of virus isolates, Measles virus remains a monotypic virus for which protective immunity is induced by vaccine strains first isolated in the 1950’s.


The genus Morbillivirus includes similar viruses that infect dogs, cats, whales, seals and cattle.  The disease of cattle is referred to by a Boer term:  Rinderpest.  Of the relatives, the rinderpest virus is the closest to measles.

At NCBI I searched for measles and found 363 genome nucleotide sequences, most of which appear complete.  I just chose one at random, NC_001498, and then got the sequence of its nucleocapsid gene, NP_056918.  A BLAST search gave a large number of hits with other Measles virus isolates down to 97% identity.

Restricting the search to Rinderpest (taxid: 11241), I got numerous hits as well, in the range of 75-80% identity.  But if you look at the alignments, there is a C-terminal region that diverges.  The N-terminal 400 aa (of 524) matches very well.  Restricting the search to 1-400 the matches were more like 88% identical, like this one:

Here is a phylogenetic tree of Morbilliviruses from this review:

One can often recognize present-day diseases in descriptions from ancient times, but measles is missing from those accounts.  The first systematic description of measles, and its distinction from smallpox and chickenpox, is credited to the Persian physician Rhazes (860–932), who published The Book of Smallpox and Measles.

By analyzing the diversity of the sequences of viral isolates, it is believed that the last common ancestor of Measles virus and Rinderpest occurred about 1000 AD plus or minus.  It is also thought that the virus "jumped" from cattle to humans, due to domestication of livestock and growth of the human population to a level that could support the virus.


Enders Nobel prize and lecture
Enders biography
Hilleman biography
Hilleman obit