Python for Bioinformatics: June 2020

Thursday, June 25, 2020

Shapefiles

I've been working with maps using Python, primarily maps for the United States. The standard format for much geographic data is GeoJSON.

But there is another format that is even more official, and is maintained by the US Census Bureau. That is a collection of Shapefiles.

To recap, here is a site where GeoJSON files for the Us are available in multiple sizes (small, medium, large), as well as their .kml and Shapefile .shp equivalents. The sizes are 500k, 5m, 20m, from largest to smallest (the labels are the scales, 500k being the most detailed).

The files were obtained from links on this page. It includes data for the US, for the states, and for counties. And it contains Congressional districts, which would be useful to remember.

A Shapefile is binary-encoded geographic data in a particular format. A good discussion is here.

The specification was developed at least partly by ESRI, which develops geographic information software.. The encoding was undoubtedly designed to save space, back when space (storage, transmission bandwidth) was much more expensive. Now, the opaque data is a liability.

Shapefiles

There is actually not just one file, but always a minimum of three including .shp, .shx and .dbf inside a zip container. The Shapefiles from the US Census also have .prj and .xml.

I can't tell much by looking at one with hexdump, except that most of it is aligned (in sections), on 16-byte boundaries. The format is described at this link, but I haven't worked with that.

One way to open and read a Shapefile is to use geopandas. I grab that with pip3 install geopandas

The example is

>>> d = 'gz_2010_us_040_00_20m'
>>> fn = d + '/gz_2010_us_040_00_20m.shp'
>>> df = gpd.read_file(fn)

The columns are:

GEO_ID
STATE
NAME
LSAD
CENSUSAREA
geometry

>>> df.NAME.head()
0        Arizona
1       Arkansas
2     California
3       Colorado
4    Connecticut
Name: NAME, dtype: object

For some reason the order is several joined partial lists of states, each one alphabetized.

We need to extract the coordinates for a particular state:

import geopandas as gpd
fn = 'ex.shp'
df = gpd.read_file(fn)
sel = df['NAME'] == 'Maine'
g = me = df.loc[sel].geometry
from shapely.geometry import mapping
D = mapping(g)
for f in D['features']:
    print(f['id'])
    L = f['geometry']['coordinates']
    
    for m in L:
        print(len(m[0]))
        print(m[0][0])
        print(m[0][1])
        print()

> python3 script.py
39
11
(-69.307908, 43.773767)
(-69.30675099999999, 43.775095)
11
(-69.42792, 43.928798)
(-69.423323, 43.922871)
...

Pythagorean theorem redux

Here is a cool proof I saw on Twitter, it was an RT by @StevenStrogatz from this guy:

Take a generic right triangle. Flip, rotate and scale it by multiplying each side by a factor of b. I did this by imagining that a = 1 and then b = ab. The complementary angles are marked with circles on the right.

So then construct a rotated triangle from the same input, but scaled by a factor of a, and attach it to the other one. We know that the sides marked ab are parallel, and that the angle between bc and ac is a right angle, by the properties of right, complementary and supplementary angles.

The four outside vertices form a rectangle, from the angles and also since ab = ab.

Finally, rotate and scale again, by a factor of c:

Slide them together

□

Albers projection

The standard projection used in converting latitudes and longitudes on the (roughly) spherical earth to a planar map depends on what you’re projecting.

Most people know about the Mercator projection.

The one used extensively for maps of the United States is called the Albers Equal-Area Conic projection.

The wikipedia article gives some formulas:

You must first choose two reference latitudes. Latitudes are referred to by the letter phi and longitudes by the letter lambda. So the two references are phi_1 and phi_2.

You also choose a center for the map, at phi_0, lambda_0.

From these four values you calculate n, C and then rho_0, which are each the same for every transformation in this projection.

Finally, for each coordinate phi, lambda one calculates rho and theta and then finally

x = rho sin theta

y = rho_0 - rho cos theta

This, however, is for the assumption of a spherical earth. The equations for an ellipsoid are a bit harder.

The wikipedia article gives this url for a pdf and I found the same report referenced in this answer to a question on Stack Exchange.

so that was lucky, because it gave me not only the equations but also a worked numerical example for each, which helped greatly in finding my mistakes in the code.

The math is explained in a write-up done in LaTeX, as a Dropbox link to a pdf. The math has been encoded as Python scripts sphere.py and ellipsoid.py here.

Here are screenshots from the manual:

The test is a latitude of 35°N., -96°W., which should give a result (for the ellipsoid) of

> python ellipsoid.py 
test1
p0: 23.0
l0: -96.0
p:  35.0
l:  -75.0
x:  1885472.7
y:  1535925.0

The first part of the output is the center we chose. The result for x and y matches the source.

So then, for a script in the same directory as ellipsoid.py, we can do import project and do

> python3 map_counties.py counties.geo.txt AL

No longer squashed.

Naturally, there is software out there that will do this. In particular, the Proj transformation software.

I obtained it with Homebrew

> brew install proj


> echo 55.2 12.2 | proj +proj=merc +lat_ts=56.5 +ellps=GRS80
3399483.80	752085.60

which matches their example.

Of course, we really want the Albers equal area conical projection based on the ellipsoid.

> echo -75 35 | proj +proj=aea +lat_1=29.5 +lat_2=44.5 +lat_0=23 +lon_0=-96 +ellps=clrk66
1887211.95	1533994.75

These are close but don't quite match. I will have to explore Proj more to know why.

Put the input data into a file and do:

> cat data.txt | proj +proj=aea +lat_1=29.5 +lat_2=44.5 +lat_0=23 +lon_0=-96 +ellps=clrk66
1887211.95	1533994.75
>

Plotting polygons

In the previous two posts (here and here), we looked at geographical heat maps (choropleth maps), constructed using plotly. The purpose was to visualize COVID-19 data. Now, we're going to dive deeper into maps, and for that purpose I made a new github repo.

As we said, it is easy to use plotly.express to make a map, say of the entire United States, or countries of the world, or US counties.

However, there are some limitations. plotly is developed by a company whose main product appears to be Dash, which aims to be a fancy UI layer for working with data science overlying the basic widgets and libraries.

Mapping, in particular, is an issue. While plotly.express.choropleth is easy, it is also limited.

Once you want to dive deeper (for example, to zoom in on a map of a single state before display, or add labels to a map), you run into the fact the express layer doesn't allow it, and that the underlying maps are actually from a company called Mapbox.

Mapbox wants to be a kind of Google maps that you embed in your iPhone app, only better. They provide an introductory free level, but you have to sign up and obtain a Mapbox Access Token for many things.

I decided to go without. For the moment, I'm going to continue with plotly, although ultimately I'm leaning toward the position that matplotlib will be a better choice.

So what we're going to try here is to plot the map components (states) as polygons. We'll run into some issues and solve them.

The first script is extract_counties.first.py.

There's nothing special about it except (i) it deals with the encoding issue we talked about before and (ii) it has a significant bug, which we will diagnose and fix.

> python3 extract_counties.first.py counties1.json > counties1.geo.txt
>


Autauga
Alabama
01001
-86.496774	32.344437
-86.717897	32.402814
-86.814912	32.340803
-86.890581	32.502974
-86.917595	32.664169
-86.71339	32.661732
-86.714219	32.705694
-86.413116	32.707386
-86.411172	32.409937
-86.496774	32.344437

Baldwin
Alabama
01003
...

Now we'll just use plotly to draw all the polygons for the state of Alabama using

> python3 map_counties.first.py counties1.geo.txt AL

That looks kind of promising, it is recognizably Alabama.

There are a couple of problems. The first is that the shape is squashed.

And the second, more subtle, is that there is a missing county at the southwest corner: Mobile.

This issue becomes more obvious when we try California.

That's definitely wrong. We are missing Los Angeles Ventura and Santa Barbara counties, and if you look closely, San Francisco is missing as well.

The solution came to me when I realized that the term "MultiPolygon" had been mentioned, and I hadn't really understood what it was. I had just adjusted for the case when the coordinates were wrapped by four '[[[[ .. ]]]]' versus three '[[[ .. ]]]' without thinking about it.

As an example, let's look at entries in the GeoJSON data that are MultiPolygons (most are just plain Polygon). It is Bethel (a County or Census Area) in Alaska. Here is a list of the first 18:

Bethel, AK
Hoonah-Angoon, AK
Juneau, AK
Kenai Peninsula, AK
Ketchikan Gateway, AK
Kodiak Island, AK
Lake and Peninsula, AK
Nome, AK
Petersburg, AK
Prince of Wales-Hyder, AK
Sitka, AK
Valdez-Cordova, AK
Wrangell, AK
Mobile, AL
Los Angeles, CA
San Francisco, CA
Santa Barbara, CA
Ventura, CA
..

We notice that Mobile and the four California counties are also on the list.

A map of Bethel reveals that it has not only an island but also a lake!

Bethel County (actually Census Area), Alaska is a MultiPolygon and has three arrays

[
 [[[-173.116905, 60.516005] ..  36 items .. [-173.116905, 60.516005]]],
 [[[-165.721389, 60.16962]  .. 128 items .. [-165.721389, 60.16962]]], 
 [[[-160.534142, 61.947257] .. 314 items .. [-160.534142, 61.947257]]]
]

Bethel County includes a large island called Nunivak Island.

The city of Mekoryuk on Nunivak has a lat,lon of 60.370411,-166.257202. This suggests that the second list is the polygon for Nunivak.

[-166.310655, 60.377611], [-166.200019, 60.393404] is one adjacent pair of vertices in the second list.

I thought from the restricted range of values in the first list that it was probably the lake, but it's not. It's a second, very small island. I made a fake GeoJSON file with just Bethel County and then plotted it:

So the prediction is that Bethel County won't render properly by my script. It's not the greatest test. Most of Alaska is messed up (not shown)

I go back and fix extract_counties.py and then run the plotting script:

python3 map_counties.first.py counties.geo.txt CA

We have found the missing counties.

The next problem to solve is the squashed forms of the states. The reason for that is that the latitudes and longitudes refer to the globe (an oblate spheroid), while the map is a projection of those points onto a 2-dimensional surface. There are a variety of projections and the math is pretty complicated, so we will save that for next time.

Wednesday, June 24, 2020

GeoJSON

GeoJSON is a JSON representation of geographic data. Although JSON was developed for javascript, the format should be very familiar to any Python programmer. There are collections: dicts and lists, as well as simple types like: string, number, boolean or null. Dictionaries are called objects, and may also be values contained in lists or other dictionaries.

A typical GeoJSON file looks like this:

{"type":"FeatureCollection",
 "features":
  [
    {"type":"Feature",
     "properties":{"GEO_ID":"0500000US01001",
                   "STATE":"01",
                   "COUNTY":"001",
                   "NAME":"Autauga",
                   "LSAD":"County",
                   "CENSUSAREA":594.436
                  }
     "geometry": {"type":"Polygon",
                  "coordinates": [[ [-86.496774, 32.344437],
                                      ...
                                    [-86.496774, 32.344437]
                                ]]
                 }
     "id":"01001"
    }
 
    {        
    "type":"Feature",
    ... 
                   "STATE": "01", 
                   "COUNTY": "009", 
                   "NAME": "Blount"
    ...
    }
 
  ]
}

(I've formatted whitespace to my taste, YMMV).

At top level, it is a dict with two keys. The "type" is "FeatureCollection".

The second key is "features", which is a list of Feature objects.

The "id" is a FIPS code for the Feature. This Feature is Autauga County, Alabama. The FIPS code for the state is "01" and the full FIPS for the county is "01001".

Each Feature is a dict with four keys. The "properties" and "geometry" keys yield dicts in turn. The "geometry" dict has a key "coordinates" which gives a list of longitude and latitudes for each vertex of the Feature's map.

The coordinates are nested inconsistently. It it's two deep: a list with one element that is a list with many 2-element vertices. Sometimes it's three deep.

[ Update: this is a great over-simplification. The problem is that these are Multipolygons, and they do mean multi. That will have to wait for another post. Here's a link to the Mapbox documentation. ]

The number of vertices depends on the resolution. For some counties at high resolution, it could run to 300 or more tuples.

For use with plotly.express.choropleth, there's no reason to parse this stuff or to generate a file with only the desired Features.

Just load the data

import json
with open(fn,'r') as fh:
    counties = json.load(fh)

then make a pandas data frame with one or more elements like

    fips  value
0  01001      1
1  06071      2

import pandas ad pd
pd.DataFrame({'fips':list_of_fips, 'value': list_of_values})

(or just pass {'fips':list_of_fips, 'value': list_of_values} as the df argument to choropleth).

fig = px.choropleth(
    df,
    geojson=counties,
    locations='fips',
    color=["Autauga","San Bernardino"],
    color_discrete_sequence=cL,
    scope='usa')
    
fig.show()

The locations argument says to match the "fips" from the data frame with the default value in the GeoJSON, which is its "id". The colors can be specified as a color list like

cL = ['green','magenta']

See the plotly tutorial section on colors for more detail. The scope argument limits the region plotted on the map.

States are even easier, since plotly already knows the state GeoJSON data.

import plotly.express as px
fig = px.choropleth(
    locations=['CA','TX','SC'],
    locationmode="USA-states", 
    color=[1,2,3], 
    scope="usa")
    
fig.show()

The county GeoJSON data is available from plotly, see the first example for the URL).

I came across a collection of GeoJSON files on the web. This data is originally from the US Census, and has been converted to GeoJSON by the author of that post.

One slight hiccup is that the GeoJSON file includes Puerto Rico, which has municipalities with Spanish names. These are encoded with ISO-8859-1.

This uses high-order bytes to represent Spanish ñ, í and so on. Some of these byte sequences are not valid UTF-8.

> python3
Python 3.7.7 (default, Mar 10 2020, 15:43:33) 
..
>>> fn = 'gz_2010_us_050_00_20m.json'
>>> fh = open(fn)
>>> data = fh.read()
Traceback (most recent call last):
..
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 935286: invalid continuation byte

Take a look:

>>> fh = open(fn,'rb')
>>> data = fh.read()
>>> data[935280:935290]
b'"Comer\xedo",'
>>> list(data[935280:935290])
[34, 67, 111, 109, 101, 114, 237, 111, 34, 44]

This is for Comerío Puerto Rico. The byte (237, 0xed) is mapped to the accented í But this and the following two bytes are

>>> '{:08b}'.format(237)
'11101101'
>>> '{:08b}'.format(111)
'01101111'
>>> '{:08b}'.format(34)
'00100010'

In UTF-8 the first byte starts with 1110, which indicates that this is a three-byte sequence, but the third byte of the group does not begin with 01 and so is not valid.

There is a nice module to document such issues. I do pip3 install validate-utf8.

> validate-utf8 src.json
invalid continuation byte
  COUNTY": "045", "NAME": "Comerío", "LSAD": "Muno", "CENSUSARE
                                ^
...

To fix this

>>> fn = 'gz_2010_us_050_00_5m.json'
>>> fh = open(fn,'rb')
>>> s = fh.decode('ISO-8859-1')
>>> fn = 'counties.json'
>>> fh = open(fn,'w')
>>> fh.write(s)
>>> fh.close()

For my later explorations, I found it useful to parse the GeoJSON to a flat format with elements like:

01001
Autauga
Alabama
-86.496774	32.344437
-86.717897	32.402814
-86.814912	32.340803
-86.890581	32.502974
-86.917595	32.664169
-86.71339	32.661732
-86.714219	32.705694
-86.413116	32.707386
-86.411172	32.409937
-86.496774	32.344437

Shapefiles and conversions are complicated enough that we'll leave them for another post.

COVID-19 data analysis

I've been working on a project to analyze the data on COVID-19 cases and deaths (GitHub repo). The data is aggregated from public health departments by some folks at Johns Hopkins. Their dashboard is here, and the data are in this directory on Github.

Here is the help page for the covid project (it is the same for most of the scripts):

> python3 one_state.py --help
flags
-h  --help    help
-n   <int>    display the last n values, default: 7
-N   <int>    display N rows of data: default: 50
-c  --delta   change or delta, display day over day rise
-d  --deaths  display deaths rather than cases (default)
-r  --rate    compute statistics
-s  --sort    (only if stats are asked for)
to do:
-u   <int>    data slice ends this many days before yesterday 
-p  --pop     normalize to population
example:
python one_state.py [state] -n 10 -sr

And here is the output (today) for that example:

> python3 one_state.py SC -rs
               06/17 06/18 06/19 06/20 06/21 06/22 06/23  stats
Charleston      1264  1403  1554  1728  1836  2044  2251  0.094
Oconee            95   100   105   110   136   154   142  0.083
Pickens          348   367   429   464   499   529   570  0.083
Calhoun           47    48    58    62    69    73    74  0.082
...
total          20556 21533 22608 23756 24661 25666 26572  0.043

The statistic is a linear regression on cases, normalized to the mean of the values, and then the counties in my state (SC) are sorted according to the result. Charleston is my county, and unfortunately, it is the county with the highest rate of growth of cases in the state. Currently in the US, the top 20 counties are:

> python3 us_by_counties.py -rs -n 4 -N 20
                   06/20   06/21   06/22   06/23  stats
Thomas, KS             0       0      10      12  0.836
Hot Spring, AR        53      53     138     226  0.514
Holmes, FL            47      47      58     121  0.341
Jim Wells, TX         22      27      34      46  0.245
Brewster, TX          24      24      39      45  0.236
Erath, TX             44      44      44      85  0.227
McDonald, MO         170     366     371     403  0.215
Sharkey, MS            9       9      13      16  0.213
Blanco, TX            14      14      22      24  0.205
Newton, TX             6       8      11      11  0.2
Aroostook, ME         11      17      19      21  0.188
Tehama, CA            34      34      53      54  0.181
Sioux, ND             12      12      19      19  0.181
Okfuskee, OK           7       7      11      11  0.178
Bourbon, KS            9       9      14      14  0.174
Lawrence, MO          11      13      13      19  0.171
Harvey, KS            13      13      20      20  0.17
Pontotoc, MS          93      93     128     146  0.169
Letcher, KY            8       8       8      13  0.162
Live Oak, TX          10      10      15      15  0.16

I chose n = 4 so the output would be formatted correctly for the blog post.

As with any large dataset, there are some problems to work through, which are not solved perfectly yet. Also, I've focused more on the U.S. lately, so scripts for world data haven't been updated yet either.

What I got interested in and want to show is the generation of maps of the US by states or counties, or one or a few states by counties, where the fill color is based on, for example, the growth rate of cases. Here is the US by states.

I haven't generated the color bar horizontally yet, I just cut it out and rotated it, so the writing is rotated as well.

This type of map is called a choropleth map. I stumbled across a python tool for generating maps. It's part of the plotly library. It is as simple as

fig = px.choropleth(
    df,
    locations=abbrev,
    locationmode='USA-states',
    color=st,
    color_continuous_scale='Plasma',
    scope="usa",
    labels={'color':'growth'})

fig.show()

The details are slightly complicated, but not bad.

df is a pandas data frame that maps states by two-letter abbreviation to the corresponding statistic.

df = pd.DataFrame(data={'state':abbrev, 'value':st})

You need GeoJSON data for a county map (the states are already known to plotly.express). That data file is available from them.

The colors are mapped to the statistic st as read from the data frame. The last line of the call to px.choropleth assigns the title to the color bar.

The script for the US states is here and for the counties it is here.

This is the state of South Carolina today. These colors are an attempt to make the positives pop out more.

There's much more to discuss. I have always wanted to make a map of the US with my road trips plotted on it, something like this. For that we need to talk about GeoJSON data and how to obtain it, as well as the Albers projection that is used in making maps. It turns out that the standard methods from plotly have a significant limitation and I had a really weird bug in my code that I eventually figured out.

Finally, we'll need to find how to generate the data for each individual trip to overlay on the map. That's all for later.