Python for Bioinformatics: COVID-19 data analysis

I've been working on a project to analyze the data on COVID-19 cases and deaths (GitHub repo). The data is aggregated from public health departments by some folks at Johns Hopkins. Their dashboard is here, and the data are in this directory on Github.

Here is the help page for the covid project (it is the same for most of the scripts):

> python3 one_state.py --help
flags
-h  --help    help
-n   <int>    display the last n values, default: 7
-N   <int>    display N rows of data: default: 50
-c  --delta   change or delta, display day over day rise
-d  --deaths  display deaths rather than cases (default)
-r  --rate    compute statistics
-s  --sort    (only if stats are asked for)
to do:
-u   <int>    data slice ends this many days before yesterday 
-p  --pop     normalize to population
example:
python one_state.py [state] -n 10 -sr

And here is the output (today) for that example:

> python3 one_state.py SC -rs
               06/17 06/18 06/19 06/20 06/21 06/22 06/23  stats
Charleston      1264  1403  1554  1728  1836  2044  2251  0.094
Oconee            95   100   105   110   136   154   142  0.083
Pickens          348   367   429   464   499   529   570  0.083
Calhoun           47    48    58    62    69    73    74  0.082
...
total          20556 21533 22608 23756 24661 25666 26572  0.043

The statistic is a linear regression on cases, normalized to the mean of the values, and then the counties in my state (SC) are sorted according to the result. Charleston is my county, and unfortunately, it is the county with the highest rate of growth of cases in the state. Currently in the US, the top 20 counties are:

> python3 us_by_counties.py -rs -n 4 -N 20
                   06/20   06/21   06/22   06/23  stats
Thomas, KS             0       0      10      12  0.836
Hot Spring, AR        53      53     138     226  0.514
Holmes, FL            47      47      58     121  0.341
Jim Wells, TX         22      27      34      46  0.245
Brewster, TX          24      24      39      45  0.236
Erath, TX             44      44      44      85  0.227
McDonald, MO         170     366     371     403  0.215
Sharkey, MS            9       9      13      16  0.213
Blanco, TX            14      14      22      24  0.205
Newton, TX             6       8      11      11  0.2
Aroostook, ME         11      17      19      21  0.188
Tehama, CA            34      34      53      54  0.181
Sioux, ND             12      12      19      19  0.181
Okfuskee, OK           7       7      11      11  0.178
Bourbon, KS            9       9      14      14  0.174
Lawrence, MO          11      13      13      19  0.171
Harvey, KS            13      13      20      20  0.17
Pontotoc, MS          93      93     128     146  0.169
Letcher, KY            8       8       8      13  0.162
Live Oak, TX          10      10      15      15  0.16

I chose n = 4 so the output would be formatted correctly for the blog post.

As with any large dataset, there are some problems to work through, which are not solved perfectly yet. Also, I've focused more on the U.S. lately, so scripts for world data haven't been updated yet either.

What I got interested in and want to show is the generation of maps of the US by states or counties, or one or a few states by counties, where the fill color is based on, for example, the growth rate of cases. Here is the US by states.

I haven't generated the color bar horizontally yet, I just cut it out and rotated it, so the writing is rotated as well.

This type of map is called a choropleth map. I stumbled across a python tool for generating maps. It's part of the plotly library. It is as simple as

fig = px.choropleth(
    df,
    locations=abbrev,
    locationmode='USA-states',
    color=st,
    color_continuous_scale='Plasma',
    scope="usa",
    labels={'color':'growth'})

fig.show()

The details are slightly complicated, but not bad.

df is a pandas data frame that maps states by two-letter abbreviation to the corresponding statistic.

df = pd.DataFrame(data={'state':abbrev, 'value':st})

You need GeoJSON data for a county map (the states are already known to plotly.express). That data file is available from them.

The colors are mapped to the statistic st as read from the data frame. The last line of the call to px.choropleth assigns the title to the color bar.

The script for the US states is here and for the counties it is here.

This is the state of South Carolina today. These colors are an attempt to make the positives pop out more.

There's much more to discuss. I have always wanted to make a map of the US with my road trips plotted on it, something like this. For that we need to talk about GeoJSON data and how to obtain it, as well as the Albers projection that is used in making maps. It turns out that the standard methods from plotly have a significant limitation and I had a really weird bug in my code that I eventually figured out.

Finally, we'll need to find how to generate the data for each individual trip to overlay on the map. That's all for later.

Wednesday, June 24, 2020

COVID-19 data analysis