Recently I've been exploring maps again, using
GeoPandas in Python. I found it confusing at first, but that was mainly because I didn't understand the underlying technology very well, especially
Pandas and the
shapely geometry library. I've had to brush up on
matplotlib, as well.
The figure above plots three thrips that we took to the US southwest, with a focus on the canyon country and Colorado. They were great trips, on highways and off the interstate. I wanted to memorialize them to help me remember.
Of course Google Maps is pretty good for this stuff. Here is a route from LA-SLC-SF. I drove these legs many times in my youth.
But I wanted more control.
I use Homebrew to obtain my own Python3, rather than relying on Apple's build that is provided with macOS. Some people dislike Homebrew, but I'm not one of them. I never have any trouble with my (simple) "stack", and if I did, I would just use a virtual environment.
One thing that has changed over time is the need to use sudo
to install Homebrew, which is moderately annoying, but I believe that happened because macOS now insists that /usr/local
be owned by root
. Perhaps I should put Homebrew in a different location, but I haven't.
In any case, I get the necessary packages by
python3 -m pip install --upgrade pip
python3 -m pip install -U numpy
python3 -m pip install -U pandas
python3 -m pip install -U matplotlib
python3 -m pip install -U geopandas
|
Then, all you need is some data. The main download page for the US Census is here. But the file I'm actually using for the boundaries of US states is at the bottom of this directory: gz_2010_us_040_00_5m.zip. Normally, geopandas should be able to read a ZIP file directly, but this one, you must unzip (or at least, that's what I did). Then
>>> import geopandas as gpd
>>> fn = 'gz_2010_us_040_00_5m'
>>> gdf = gpd.read_file(fn)
>>> gdf
GEO_ID ... geometry
0 0400000US01 ... MULTIPOLYGON (((-88.12466 30.28364, -88.08681 ...
..
[52 rows x 6 columns]
>>> gdf.columns
Index(['GEO_ID', 'STATE', 'NAME', 'LSAD', 'CENSUSAREA', 'geometry'], dtype='object')
>>>
|
It's as simple as that! To follow what I did, take a look at the github repo for this project.
A few quick tips. First, GeoPandas does things the Pandas way. So to get all the data for the state of Hawaii, say, you do:
sub = gdf[gdf['NAME'] == 'Hawaii']
|
or even
sub = gdf[gdf['STATE'] == '15']
|
where NAME and STATE are columns of the dataframe (and the FIPS code for Hawaii is 15). I think of an expression like that as a selector.
sel = gdf['NAME'] == 'Hawaii'
|
but it is actually a pandas "Series" of boolean values which the GeoDataFrame
accepts as input for the indexing operator. The rules for what pandas will accept as a selector are much stricter than I would like. However, I discovered that a simple Python list of booleans will also work.
>>> def f(e):
... if e.endswith('ii'):
... return True
... return False
...
>>>
>>> sel = [f(e) for e in gdf['NAME']]
>>> gdf[sel]
GEO_ID ... geometry
11 0400000US15 ... MULTIPOLYGON (((-155.77823 20.24574, -155.7727...
[1 rows x 6 columns]
>>>
|
Note that the original index is retained.
It can also be useful to get a sense of what is in a column by doing
>>> L = list(gdf['LSAD'])
>>> L
[nan, nan ..]
>>> list(set(L))
[nan, nan ..]
>>>
|
but not in this case. Also, I notice that the old trick
set(L)
doesn't work with
nan
(not a number). Which is weird, because I would have thought it was from
numpy
>>> import numpy as np
>>> L = [np.nan,np.nan]
>>> set(L)
{nan}
>>> list(set(L))
[nan]
>>>
|
It's important to realize how shapely.geometry
objects are structured, at least, if what you want to do is get your hands on the underlying data. You can see that we have a MULTIPOLYGON
but it is frustrating to get at it.
>>> sub = gdf[sel]
>>> sub
GEO_ID ... geometry
11 0400000US15 ... MULTIPOLYGON (((-155.77823 20.24574, -155.7727...
[1 rows x 6 columns]
>>>
|
The first thing is that the result of
sub['geometry']
(or
sub.geometry
) is a
geopandas.geoseries.GeoSeries
object and it can be subscripted. Not in the usual way but by
>>> mp = sub['geometry'].iloc[0]
|
I used
mp
for MULTIPOLYGON. That guy has component geoms, in fact, 9 of them. We can get the first one by
This will turn out to be the Big Island, Hawai'i. Any polygon has an exterior (and an optional interior, i.e. a hole). The exterior has
coords
which have an attribute
xy
so we do
>>> X,Y = poly.exterior.coords.xy
>>> X = X.tolist()
>>> print(len(X)) # 230
|
I don't have a "backend" for matplotlib on my setup, so I can't just do
plt.showfig()
. I do this instead:
>>> import matplotlib.pyplot as plt
>>>
>>> gdf.plot()
>>> plt.savefig('hawaii.png')
>>> import subprocess
>>> subprocess.run(['open','-a','Preview','x.png'])
..
|
The long traces that result from an error are often not informative. But you can just do some caveman debugging starting from right before the call that failed, which is first in the list. Something like
print(f'debug: {var1=} {var2=}')
|
Update: I added a short demo of
explode
. (The
index_parts
is to silence a warning I don't fully understand).
We capture the result of islands.plot()
, a matplotlib Axes
and use that to annotate the plot later. A full listing is below, without the annoying >>>
. A smart approach uses apply
, that's for another day.
You may notice that the individual islands are not named. Look at github (hawaii.py) to see how I handled that. AFAIK there is no identification in the shapefile.
import subprocess
import geopandas as gpd
import matplotlib.pyplot as plt
fn = '~/data/gz_2010_us_040_00_5m'
gdf = gpd.read_file(fn)
sel = gdf['NAME'] == 'Hawaii'
hawaii = gdf[sel]
islands = hawaii['geometry'].explode(
index_parts=True)
ax = islands.plot(cmap='Set3')
plt.rcParams.update({'font.size': 22})
ax.annotate(text='Hawaii',
xy = [-158,20],
ha = 'center')
ofn = 'hawaii.png'
plt.savefig(ofn)
subprocess.run(['open','-a','Preview',ofn])
|