Thursday, June 25, 2020

Shapefiles

I've been working with maps using Python, primarily maps for the United States.  The standard format for much geographic data is GeoJSON.  

But there is another format that is even more official, and is maintained by the US Census Bureau.  That is a collection of Shapefiles.

To recap, here is a site where GeoJSON files for the Us are available in multiple sizes (small, medium, large), as well as their .kml and Shapefile .shp equivalents.  The sizes are 500k, 5m, 20m, from largest to smallest (the labels are the scales, 500k being the most detailed).
 
The files were obtained from links on this page.  It includes data for the US, for the states, and for counties.  And it contains Congressional districts, which would be useful to remember.

A Shapefile is binary-encoded geographic data in a particular format.  A good discussion is here.

The specification was developed at least partly by ESRI, which develops geographic information software..  The encoding was undoubtedly designed to save space, back when space (storage, transmission bandwidth) was much more expensive.  Now, the opaque data is a liability.

Shapefiles

There is actually not just one file, but always a minimum of three including .shp, .shx and .dbf inside a zip container.  The Shapefiles from the US Census also have .prj and .xml.

I can't tell much by looking at one with hexdump, except that most of it is aligned (in sections), on 16-byte boundaries.  The format is described at this link, but I haven't worked with that.

One way to open and read a Shapefile is to use geopandas.  I grab that with pip3 install geopandas

The example is

>>> d = 'gz_2010_us_040_00_20m'
>>> fn = d + '/gz_2010_us_040_00_20m.shp'
>>> df = gpd.read_file(fn) 

The columns are:
  • GEO_ID
  • STATE
  • NAME
  • LSAD
  • CENSUSAREA
  • geometry
>>> df.NAME.head()
0        Arizona
1       Arkansas
2     California
3       Colorado
4    Connecticut
Name: NAME, dtype: object

For some reason the order is several joined partial lists of states, each one alphabetized.

We need to extract the coordinates for a particular state:

import geopandas as gpd
fn = 'ex.shp'
df = gpd.read_file(fn)
sel = df['NAME'] == 'Maine'
g = me = df.loc[sel].geometry
from shapely.geometry import mapping
D = mapping(g)
for f in D['features']:
print(f['id'])
L = f['geometry']['coordinates']

for m in L:
print(len(m[0]))
print(m[0][0])
print(m[0][1])
print()

> python3 script.py
39
11
(-69.307908, 43.773767)
(-69.30675099999999, 43.775095)
11
(-69.42792, 43.928798)
(-69.423323, 43.922871)
...