Sunday, June 18, 2023

Geopandas and maps

Recently I've been exploring maps again, using GeoPandas in Python. I found it confusing at first, but that was mainly because I didn't understand the underlying technology very well, especially Pandas and the shapely geometry library. I've had to brush up on matplotlib, as well.

The figure above plots three thrips that we took to the US southwest, with a focus on the canyon country and Colorado. They were great trips, on highways and off the interstate. I wanted to memorialize them to help me remember. Of course Google Maps is pretty good for this stuff. Here is a route from LA-SLC-SF. I drove these legs many times in my youth.

But I wanted more control.

I use Homebrew to obtain my own Python3, rather than relying on Apple's build that is provided with macOS. Some people dislike Homebrew, but I'm not one of them. I never have any trouble with my (simple) "stack", and if I did, I would just use a virtual environment.

One thing that has changed over time is the need to use sudo to install Homebrew, which is moderately annoying, but I believe that happened because macOS now insists that /usr/local be owned by root. Perhaps I should put Homebrew in a different location, but I haven't.

In any case, I get the necessary packages by

python3 -m pip install --upgrade pip
python3 -m pip install -U numpy
python3 -m pip install -U pandas
python3 -m pip install -U matplotlib
python3 -m pip install -U geopandas

Then, all you need is some data. The main download page for the US Census is here. But the file I'm actually using for the boundaries of US states is at the bottom of this directory: gz_2010_us_040_00_5m.zip. Normally, geopandas should be able to read a ZIP file directly, but this one, you must unzip (or at least, that's what I did). Then

>>> import geopandas as gpd
>>> fn = 'gz_2010_us_040_00_5m'
>>> gdf = gpd.read_file(fn)
>>> gdf
         GEO_ID  ...                                           geometry
0   0400000US01  ...  MULTIPOLYGON (((-88.12466 30.28364, -88.08681 ...
..
[52 rows x 6 columns]
>>> gdf.columns
Index(['GEO_ID', 'STATE', 'NAME', 'LSAD', 'CENSUSAREA', 'geometry'], dtype='object')
>>>

It's as simple as that! To follow what I did, take a look at the github repo for this project.

A few quick tips. First, GeoPandas does things the Pandas way. So to get all the data for the state of Hawaii, say, you do:

sub = gdf[gdf['NAME'] == 'Hawaii']

or even

sub = gdf[gdf['STATE'] == '15']

where NAME and STATE are columns of the dataframe (and the FIPS code for Hawaii is 15). I think of an expression like that as a selector.

sel = gdf['NAME'] == 'Hawaii'

but it is actually a pandas "Series" of boolean values which the GeoDataFrame accepts as input for the indexing operator. The rules for what pandas will accept as a selector are much stricter than I would like. However, I discovered that a simple Python list of booleans will also work.

>>> def f(e):
...     if e.endswith('ii'):
...         return True
...     return False
... 
>>>
>>> sel = [f(e) for e in gdf['NAME']]
>>> gdf[sel]
         GEO_ID  ...                                           geometry
11  0400000US15  ...  MULTIPOLYGON (((-155.77823 20.24574, -155.7727...

[1 rows x 6 columns]
>>> 

Note that the original index is retained. It can also be useful to get a sense of what is in a column by doing
>>> L = list(gdf['LSAD'])
>>> L
[nan, nan ..]
>>> list(set(L))
[nan, nan ..]
>>>

but not in this case. Also, I notice that the old trick set(L) doesn't work with nan (not a number). Which is weird, because I would have thought it was from numpy
>>> import numpy as np
>>> L = [np.nan,np.nan]
>>> set(L)
{nan}
>>> list(set(L))
[nan]
>>>

It's important to realize how shapely.geometry objects are structured, at least, if what you want to do is get your hands on the underlying data. You can see that we have a MULTIPOLYGON but it is frustrating to get at it.

>>> sub = gdf[sel]
>>> sub
         GEO_ID  ...                                           geometry
11  0400000US15  ...  MULTIPOLYGON (((-155.77823 20.24574, -155.7727...

[1 rows x 6 columns]
>>>

The first thing is that the result of sub['geometry'] (or sub.geometry) is a geopandas.geoseries.GeoSeries object and it can be subscripted. Not in the usual way but by
>>> mp = sub['geometry'].iloc[0]

I used mp for MULTIPOLYGON. That guy has component geoms, in fact, 9 of them. We can get the first one by
>>> poly = mp.geoms[0]

This will turn out to be the Big Island, Hawai'i. Any polygon has an exterior (and an optional interior, i.e. a hole). The exterior has coords which have an attribute xy so we do
>>> X,Y = poly.exterior.coords.xy
>>> X = X.tolist()
>>> print(len(X))  # 230

I don't have a "backend" for matplotlib on my setup, so I can't just do plt.showfig(). I do this instead:
>>> import matplotlib.pyplot as plt
>>>
>>> gdf.plot()

>>> plt.savefig('hawaii.png')
>>> import subprocess
>>> subprocess.run(['open','-a','Preview','x.png'])
..

The long traces that result from an error are often not informative. But you can just do some caveman debugging starting from right before the call that failed, which is first in the list. Something like

print(f'debug:  {var1=} {var2=}')

Update: I added a short demo of explode. (The index_parts is to silence a warning I don't fully understand).

We capture the result of islands.plot(), a matplotlib Axes and use that to annotate the plot later. A full listing is below, without the annoying >>>. A smart approach uses apply, that's for another day.

You may notice that the individual islands are not named. Look at github (hawaii.py) to see how I handled that. AFAIK there is no identification in the shapefile.

import subprocess
import geopandas as gpd
import matplotlib.pyplot as plt

fn = '~/data/gz_2010_us_040_00_5m'
gdf = gpd.read_file(fn)
sel = gdf['NAME'] == 'Hawaii'
hawaii = gdf[sel]

islands = hawaii['geometry'].explode(
    index_parts=True)
ax = islands.plot(cmap='Set3')

plt.rcParams.update({'font.size': 22})
ax.annotate(text='Hawaii',
    xy = [-158,20],
    ha = 'center')

ofn = 'hawaii.png'
plt.savefig(ofn)
subprocess.run(['open','-a','Preview',ofn])