Thursday, June 25, 2020

Plotting polygons

In the previous two posts (here and here), we looked at geographical heat maps (choropleth maps), constructed using plotly.  The purpose was to visualize COVID-19 data.  Now, we're going to dive deeper into maps, and for that purpose I made a new github repo.

As we said, it is easy to use plotly.express to make a map, say of the entire United States, or countries of the world, or US counties.

However, there are some limitations.  plotly is developed by a company whose main product appears to be Dash, which aims to be a fancy UI layer for working with data science overlying the basic widgets and libraries.

Mapping, in particular, is an issue.  While plotly.express.choropleth is easy, it is also limited.  

Once you want to dive deeper (for example, to zoom in on a map of a single state before display, or add labels to a map), you run into the fact the express layer doesn't allow it, and that the underlying maps are actually from a company called Mapbox.  

Mapbox wants to be a kind of Google maps that you embed in your iPhone app, only better.  They provide an introductory free level, but you have to sign up and obtain a Mapbox Access Token for many things.  

I decided to go without.  For the moment, I'm going to continue with plotly, although ultimately I'm leaning toward the position that matplotlib will be a better choice.

So what we're going to try here is to plot the map components (states) as polygons.  We'll run into some issues and solve them.

The first script is extract_counties.first.py.

There's nothing special about it except (i) it deals with the encoding issue we talked about before and (ii) it has a significant bug, which we will diagnose and fix.

> python3 extract_counties.first.py counties1.json > counties1.geo.txt
>

Autauga
Alabama
01001
-86.496774 32.344437
-86.717897 32.402814
-86.814912 32.340803
-86.890581 32.502974
-86.917595 32.664169
-86.71339 32.661732
-86.714219 32.705694
-86.413116 32.707386
-86.411172 32.409937
-86.496774 32.344437
Baldwin
Alabama
01003
...

Now we'll just use plotly to draw all the polygons for the state of Alabama using 

> python3 map_counties.first.py counties1.geo.txt AL

That looks kind of promising, it is recognizably Alabama.

There are a couple of problems.  The first is that the shape is squashed.  



And the second, more subtle, is that there is a missing county at the southwest corner:  Mobile.

This issue becomes more obvious when we try California.





That's definitely wrong.  We are missing Los Angeles Ventura and Santa Barbara counties, and if you look closely, San Francisco is missing as well.

The solution came to me when I realized that the term "MultiPolygon" had been mentioned, and I hadn't really understood what it was.  I had just adjusted for the case when the coordinates were wrapped by four '[[[[ .. ]]]]' versus three '[[[ .. ]]]' without thinking about it.

As an example, let's look at entries in the GeoJSON data that are MultiPolygons (most are just plain Polygon).  It is Bethel (a County or Census Area) in Alaska.  Here is a list of the first 18:

Bethel, AK
Hoonah-Angoon, AK
Juneau, AK
Kenai Peninsula, AK
Ketchikan Gateway, AK
Kodiak Island, AK
Lake and Peninsula, AK
Nome, AK
Petersburg, AK
Prince of Wales-Hyder, AK
Sitka, AK
Valdez-Cordova, AK
Wrangell, AK
Mobile, AL
Los Angeles, CA
San Francisco, CA
Santa Barbara, CA
Ventura, CA
..

We notice that Mobile and the four California counties are also on the list.

A map of Bethel reveals that it has not only an island but also a lake!


Bethel County (actually Census Area), Alaska is a MultiPolygon and has three arrays

[
 [[[-173.116905, 60.516005] ..  36 items .. [-173.116905, 60.516005]]],
 [[[-165.721389, 60.16962]  .. 128 items .. [-165.721389, 60.16962]]], 
 [[[-160.534142, 61.947257] .. 314 items .. [-160.534142, 61.947257]]]
]

Bethel County includes a large island called Nunivak Island.

The city of Mekoryuk on Nunivak has a lat,lon of 60.370411,-166.257202.  This suggests that the second list is the polygon for Nunivak.

[-166.310655, 60.377611], [-166.200019, 60.393404] is one adjacent pair of vertices in the second list.

I thought from the restricted range of values in the first list that it was probably the lake, but it's not.  It's a second, very small island.  I made a fake GeoJSON file with just Bethel County and then plotted it:



So the prediction is that Bethel County won't render properly by my script.  It's not the greatest test.  Most of Alaska is messed up (not shown)

I go back and fix extract_counties.py and then run the plotting script:

python3 map_counties.first.py counties.geo.txt CA



We have found the missing counties.



The next problem to solve is the squashed forms of the states.  The reason for that is that the latitudes and longitudes refer to the globe (an oblate spheroid), while the map is a projection of those points onto a 2-dimensional surface.  There are a variety of projections and the math is pretty complicated, so we will save that for next time.