Thursday, June 25, 2020

Shapefiles

I've been working with maps using Python, primarily maps for the United States.  The standard format for much geographic data is GeoJSON.  

But there is another format that is even more official, and is maintained by the US Census Bureau.  That is a collection of Shapefiles.

To recap, here is a site where GeoJSON files for the Us are available in multiple sizes (small, medium, large), as well as their .kml and Shapefile .shp equivalents.  The sizes are 500k, 5m, 20m, from largest to smallest (the labels are the scales, 500k being the most detailed).
 
The files were obtained from links on this page.  It includes data for the US, for the states, and for counties.  And it contains Congressional districts, which would be useful to remember.

A Shapefile is binary-encoded geographic data in a particular format.  A good discussion is here.

The specification was developed at least partly by ESRI, which develops geographic information software..  The encoding was undoubtedly designed to save space, back when space (storage, transmission bandwidth) was much more expensive.  Now, the opaque data is a liability.

Shapefiles

There is actually not just one file, but always a minimum of three including .shp, .shx and .dbf inside a zip container.  The Shapefiles from the US Census also have .prj and .xml.

I can't tell much by looking at one with hexdump, except that most of it is aligned (in sections), on 16-byte boundaries.  The format is described at this link, but I haven't worked with that.

One way to open and read a Shapefile is to use geopandas.  I grab that with pip3 install geopandas

The example is

>>> d = 'gz_2010_us_040_00_20m'
>>> fn = d + '/gz_2010_us_040_00_20m.shp'
>>> df = gpd.read_file(fn) 

The columns are:
  • GEO_ID
  • STATE
  • NAME
  • LSAD
  • CENSUSAREA
  • geometry
>>> df.NAME.head()
0        Arizona
1       Arkansas
2     California
3       Colorado
4    Connecticut
Name: NAME, dtype: object

For some reason the order is several joined partial lists of states, each one alphabetized.

We need to extract the coordinates for a particular state:

import geopandas as gpd
fn = 'ex.shp'
df = gpd.read_file(fn)
sel = df['NAME'] == 'Maine'
g = me = df.loc[sel].geometry
from shapely.geometry import mapping
D = mapping(g)
for f in D['features']:
print(f['id'])
L = f['geometry']['coordinates']

for m in L:
print(len(m[0]))
print(m[0][0])
print(m[0][1])
print()

> python3 script.py
39
11
(-69.307908, 43.773767)
(-69.30675099999999, 43.775095)
11
(-69.42792, 43.928798)
(-69.423323, 43.922871)
...

Pythagorean theorem redux

Here is a cool proof I saw on Twitter, it was an RT by @StevenStrogatz from this guy:



Take a generic right triangle.  Flip, rotate and scale it by multiplying each side by a factor of b.  I did this by imagining that a = 1 and then b = ab.  The complementary angles are marked with circles on the right.


So then construct a rotated triangle from the same input, but scaled by a factor of a, and attach it to the other one.  We know that the sides marked ab are parallel, and that the angle between bc and ac is a right angle, by the properties of right, complementary and supplementary angles.


The four outside vertices form a rectangle, from the angles and also since ab = ab.

Finally, rotate and scale again, by a factor of c:


Slide them together


Albers projection

The standard projection used in converting latitudes and longitudes on the (roughly) spherical earth to a planar map depends on what you’re projecting.

Most people know about the Mercator projection.

The one used extensively for maps of the United States is called the Albers Equal-Area Conic projection.



The wikipedia article gives some formulas:


You must first choose two reference latitudes.  Latitudes are referred to by the letter phi and longitudes by the letter lambda.  So the two references are phi_1 and phi_2. 

You also choose a center for the map, at phi_0, lambda_0.

From these four values you calculate n, C and then rho_0, which are each the same for every transformation in this projection.  

Finally, for each coordinate phi, lambda one calculates rho and theta and then finally

x = rho sin theta
y = rho_0 - rho cos theta

This, however, is for the assumption of a spherical earth.  The equations for an ellipsoid are a bit harder.

The wikipedia article gives this url for a pdf and I found the same report referenced in this answer to a question on Stack Exchange.

so that was lucky, because it gave me not only the equations but also a worked numerical example for each, which helped greatly in finding my mistakes in the code.

The math is explained in a write-up done in LaTeX, as a Dropbox link to a pdf.  The math has been encoded as Python scripts sphere.py and ellipsoid.py here.

Here are screenshots from the manual:



The test is a latitude of 35°N., -96°W., which should give a result (for the ellipsoid) of 



> python ellipsoid.py 
test1
p0: 23.0
l0: -96.0
p:  35.0
l:  -75.0
x:  1885472.7
y:  1535925.0

The first part of the output is the center we chose.  The result for x and y matches the source.

So then, for a script in the same directory as ellipsoid.py, we can do import project and do 

> python3 map_counties.py counties.geo.txt AL



No longer squashed.

Naturally, there is software out there that will do this.  In particular, the Proj transformation software.

I obtained it with Homebrew

> brew install proj

> echo 55.2 12.2 | proj +proj=merc +lat_ts=56.5 +ellps=GRS80
3399483.80 752085.60

which matches their example.

Of course, we really want the Albers equal area conical projection based on the ellipsoid.
> echo -75 35 | proj +proj=aea +lat_1=29.5 +lat_2=44.5 +lat_0=23 +lon_0=-96 +ellps=clrk66
1887211.95 1533994.75
These are close but don't quite match.  I will have to explore Proj more to know why.

Put the input data into a file and do:
> cat data.txt | proj +proj=aea +lat_1=29.5 +lat_2=44.5 +lat_0=23 +lon_0=-96 +ellps=clrk66
1887211.95	1533994.75
>



Plotting polygons

In the previous two posts (here and here), we looked at geographical heat maps (choropleth maps), constructed using plotly.  The purpose was to visualize COVID-19 data.  Now, we're going to dive deeper into maps, and for that purpose I made a new github repo.

As we said, it is easy to use plotly.express to make a map, say of the entire United States, or countries of the world, or US counties.

However, there are some limitations.  plotly is developed by a company whose main product appears to be Dash, which aims to be a fancy UI layer for working with data science overlying the basic widgets and libraries.

Mapping, in particular, is an issue.  While plotly.express.choropleth is easy, it is also limited.  

Once you want to dive deeper (for example, to zoom in on a map of a single state before display, or add labels to a map), you run into the fact the express layer doesn't allow it, and that the underlying maps are actually from a company called Mapbox.  

Mapbox wants to be a kind of Google maps that you embed in your iPhone app, only better.  They provide an introductory free level, but you have to sign up and obtain a Mapbox Access Token for many things.  

I decided to go without.  For the moment, I'm going to continue with plotly, although ultimately I'm leaning toward the position that matplotlib will be a better choice.

So what we're going to try here is to plot the map components (states) as polygons.  We'll run into some issues and solve them.

The first script is extract_counties.first.py.

There's nothing special about it except (i) it deals with the encoding issue we talked about before and (ii) it has a significant bug, which we will diagnose and fix.

> python3 extract_counties.first.py counties1.json > counties1.geo.txt
>

Autauga
Alabama
01001
-86.496774 32.344437
-86.717897 32.402814
-86.814912 32.340803
-86.890581 32.502974
-86.917595 32.664169
-86.71339 32.661732
-86.714219 32.705694
-86.413116 32.707386
-86.411172 32.409937
-86.496774 32.344437
Baldwin
Alabama
01003
...

Now we'll just use plotly to draw all the polygons for the state of Alabama using 

> python3 map_counties.first.py counties1.geo.txt AL

That looks kind of promising, it is recognizably Alabama.

There are a couple of problems.  The first is that the shape is squashed.  



And the second, more subtle, is that there is a missing county at the southwest corner:  Mobile.

This issue becomes more obvious when we try California.





That's definitely wrong.  We are missing Los Angeles Ventura and Santa Barbara counties, and if you look closely, San Francisco is missing as well.

The solution came to me when I realized that the term "MultiPolygon" had been mentioned, and I hadn't really understood what it was.  I had just adjusted for the case when the coordinates were wrapped by four '[[[[ .. ]]]]' versus three '[[[ .. ]]]' without thinking about it.

As an example, let's look at entries in the GeoJSON data that are MultiPolygons (most are just plain Polygon).  It is Bethel (a County or Census Area) in Alaska.  Here is a list of the first 18:

Bethel, AK
Hoonah-Angoon, AK
Juneau, AK
Kenai Peninsula, AK
Ketchikan Gateway, AK
Kodiak Island, AK
Lake and Peninsula, AK
Nome, AK
Petersburg, AK
Prince of Wales-Hyder, AK
Sitka, AK
Valdez-Cordova, AK
Wrangell, AK
Mobile, AL
Los Angeles, CA
San Francisco, CA
Santa Barbara, CA
Ventura, CA
..

We notice that Mobile and the four California counties are also on the list.

A map of Bethel reveals that it has not only an island but also a lake!


Bethel County (actually Census Area), Alaska is a MultiPolygon and has three arrays

[
 [[[-173.116905, 60.516005] ..  36 items .. [-173.116905, 60.516005]]],
 [[[-165.721389, 60.16962]  .. 128 items .. [-165.721389, 60.16962]]], 
 [[[-160.534142, 61.947257] .. 314 items .. [-160.534142, 61.947257]]]
]

Bethel County includes a large island called Nunivak Island.

The city of Mekoryuk on Nunivak has a lat,lon of 60.370411,-166.257202.  This suggests that the second list is the polygon for Nunivak.

[-166.310655, 60.377611], [-166.200019, 60.393404] is one adjacent pair of vertices in the second list.

I thought from the restricted range of values in the first list that it was probably the lake, but it's not.  It's a second, very small island.  I made a fake GeoJSON file with just Bethel County and then plotted it:



So the prediction is that Bethel County won't render properly by my script.  It's not the greatest test.  Most of Alaska is messed up (not shown)

I go back and fix extract_counties.py and then run the plotting script:

python3 map_counties.first.py counties.geo.txt CA



We have found the missing counties.



The next problem to solve is the squashed forms of the states.  The reason for that is that the latitudes and longitudes refer to the globe (an oblate spheroid), while the map is a projection of those points onto a 2-dimensional surface.  There are a variety of projections and the math is pretty complicated, so we will save that for next time.

Wednesday, June 24, 2020

GeoJSON

GeoJSON is a JSON representation of geographic data.  Although JSON was developed for javascript, the format should be very familiar to any Python programmer.  There are collections:  dicts and lists, as well as simple types like:  string, number, boolean or null.  Dictionaries are called objects, and may also be values contained in lists or other dictionaries.

A typical GeoJSON file looks like this:
{"type":"FeatureCollection",
 "features":
  [
    {"type":"Feature",
     "properties":{"GEO_ID":"0500000US01001",
                   "STATE":"01",
                   "COUNTY":"001",
                   "NAME":"Autauga",
                   "LSAD":"County",
                   "CENSUSAREA":594.436
                  }
     "geometry": {"type":"Polygon",
                  "coordinates": [[ [-86.496774, 32.344437],
                                      ...
                                    [-86.496774, 32.344437]
                                ]]
                 }
     "id":"01001"
    }
 
    {        
    "type":"Feature",
    ... 
                   "STATE": "01", 
                   "COUNTY": "009", 
                   "NAME": "Blount"
    ...
    }
 
  ]
}
(I've formatted whitespace to my taste, YMMV).

At top level, it is a dict with two keys.  The "type" is "FeatureCollection".

The second key is "features", which is a list of Feature objects.

The "id" is a FIPS code for the Feature.  This Feature is Autauga County, Alabama.  The FIPS code for the state is "01" and the full FIPS for the county is "01001".

Each Feature is a dict with four keys.  The "properties" and "geometry" keys yield dicts in turn.  The "geometry" dict has a key "coordinates" which gives a list of longitude and latitudes for each vertex of the Feature's map.

The coordinates are nested inconsistently.  It it's two deep:  a list with one element that is a list with many 2-element vertices.  Sometimes it's three deep.

[ Update:  this is a great over-simplification.  The problem is that these are Multipolygons, and they do mean multi.  That will have to wait for another post.  Here's a link to the Mapbox documentation. ]

The number of vertices depends on the resolution.  For some counties at high resolution, it could run to 300 or more tuples.

For use with plotly.express.choropleth, there's no reason to parse this stuff or to generate a file with only the desired Features.  

Just load the data

import json
with open(fn,'r') as fh:
    counties = json.load(fh)

then make a pandas data frame with one or more elements like

    fips  value
0  01001      1
1  06071      2

by

import pandas ad pd
pd.DataFrame({'fips':list_of_fips, 'value': list_of_values})

(or just pass {'fips':list_of_fips, 'value': list_of_values} as the df argument to choropleth).

fig = px.choropleth(
    df,
    geojson=counties,
    locations='fips',
    color=["Autauga","San Bernardino"],
    color_discrete_sequence=cL,
    scope='usa')
    
fig.show()
The locations argument says to match the "fips" from the data frame with the default value in the GeoJSON, which is its "id".  The colors can be specified as a color list like

cL = ['green','magenta']

See the plotly tutorial section on colors for more detail.  The scope argument limits the region plotted on the map.

States are even easier, since plotly already knows the state GeoJSON data.

import plotly.express as px
fig = px.choropleth(
    locations=['CA','TX','SC'],
    locationmode="USA-states", 
    color=[1,2,3], 
    scope="usa")
    
fig.show()

The county GeoJSON data is available from plotly, see the first example for the URL).

I came across a collection of GeoJSON files on the web.  This data is originally from the US Census, and has been converted to GeoJSON by the author of that post.

One slight hiccup is that the GeoJSON file includes Puerto Rico, which has municipalities with Spanish names.  These are encoded with ISO-8859-1.  

This uses high-order bytes to represent Spanish ñ, í and so on.  Some of these byte sequences are not valid UTF-8.

> python3
Python 3.7.7 (default, Mar 10 2020, 15:43:33) 
..
>>> fn = 'gz_2010_us_050_00_20m.json'
>>> fh = open(fn)
>>> data = fh.read()
Traceback (most recent call last):
..
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 935286: invalid continuation byte

Take a look:

>>> fh = open(fn,'rb')
>>> data = fh.read()
>>> data[935280:935290]
b'"Comer\xedo",'
>>> list(data[935280:935290])
[34, 67, 111, 109, 101, 114, 237, 111, 34, 44]

This is for Comerío Puerto Rico.  The byte (237, 0xed) is mapped to the accented í   But this and the following two bytes are

>>> '{:08b}'.format(237)
'11101101'
>>> '{:08b}'.format(111)
'01101111'
>>> '{:08b}'.format(34)
'00100010'
In UTF-8  the first byte starts with 1110, which indicates that this is a three-byte sequence, but the third byte of the group does not begin with 01 and so is not valid.

There is a nice module to document such issues.  I do pip3 install validate-utf8.
> validate-utf8 src.json
invalid continuation byte
  COUNTY": "045", "NAME": "Comerío", "LSAD": "Muno", "CENSUSARE
                                ^
...

To fix this

>>> fn = 'gz_2010_us_050_00_5m.json'
>>> fh = open(fn,'rb')
>>> s = fh.decode('ISO-8859-1')
>>> fn = 'counties.json'
>>> fh = open(fn,'w')
>>> fh.write(s)
>>> fh.close()

For my later explorations, I found it useful to parse the GeoJSON to a flat format with elements like:
01001
Autauga
Alabama
-86.496774 32.344437
-86.717897 32.402814
-86.814912 32.340803
-86.890581 32.502974
-86.917595 32.664169
-86.71339 32.661732
-86.714219 32.705694
-86.413116 32.707386
-86.411172 32.409937
-86.496774 32.344437
Shapefiles and conversions are complicated enough that we'll leave them for another post.

COVID-19 data analysis

I've been working on a project to analyze the data on COVID-19 cases and deaths (GitHub repo). The data is aggregated from public health departments by some folks at Johns Hopkins.  Their dashboard is here, and the data are in this directory on Github.

Here is the help page for the covid project (it is the same for most of the scripts):
> python3 one_state.py --help
flags
-h  --help    help
-n   <int>    display the last n values, default: 7
-N   <int>    display N rows of data: default: 50
-c  --delta   change or delta, display day over day rise
-d  --deaths  display deaths rather than cases (default)
-r  --rate    compute statistics
-s  --sort    (only if stats are asked for)
to do:
-u   <int>    data slice ends this many days before yesterday 
-p  --pop     normalize to population
example:
python one_state.py [state] -n 10 -sr
And here is the output (today) for that example:
> python3 one_state.py SC -rs
               06/17 06/18 06/19 06/20 06/21 06/22 06/23  stats
Charleston      1264  1403  1554  1728  1836  2044  2251  0.094
Oconee            95   100   105   110   136   154   142  0.083
Pickens          348   367   429   464   499   529   570  0.083
Calhoun           47    48    58    62    69    73    74  0.082
...
total          20556 21533 22608 23756 24661 25666 26572  0.043
The statistic is a linear regression on cases, normalized to the mean of the values, and then the counties in my state (SC) are sorted according to the result.  Charleston is my county, and unfortunately, it is the county with the highest rate of growth of cases in the state.  Currently in the US, the top 20 counties are:

> python3 us_by_counties.py -rs -n 4 -N 20
                   06/20   06/21   06/22   06/23  stats
Thomas, KS             0       0      10      12  0.836
Hot Spring, AR        53      53     138     226  0.514
Holmes, FL            47      47      58     121  0.341
Jim Wells, TX         22      27      34      46  0.245
Brewster, TX          24      24      39      45  0.236
Erath, TX             44      44      44      85  0.227
McDonald, MO         170     366     371     403  0.215
Sharkey, MS            9       9      13      16  0.213
Blanco, TX            14      14      22      24  0.205
Newton, TX             6       8      11      11  0.2
Aroostook, ME         11      17      19      21  0.188
Tehama, CA            34      34      53      54  0.181
Sioux, ND             12      12      19      19  0.181
Okfuskee, OK           7       7      11      11  0.178
Bourbon, KS            9       9      14      14  0.174
Lawrence, MO          11      13      13      19  0.171
Harvey, KS            13      13      20      20  0.17
Pontotoc, MS          93      93     128     146  0.169
Letcher, KY            8       8       8      13  0.162
Live Oak, TX          10      10      15      15  0.16

I chose n = 4 so the output would be formatted correctly for the blog post.

As with any large dataset, there are some problems to work through, which are not solved perfectly yet.  Also, I've focused more on the U.S. lately, so scripts for world data haven't been updated yet either.

What I got interested in and want to show is the generation of maps of the US by states or counties, or one or a few states by counties, where the fill color is based on, for example, the growth rate of cases.  Here is the US by states.



I haven't generated the color bar horizontally yet, I just cut it out and rotated it, so the writing is rotated as well.

This type of map is called a choropleth map.  I stumbled across a python tool for generating maps.  It's part of the plotly library.  It is as simple as 

fig = px.choropleth(
    df,
    locations=abbrev,
    locationmode='USA-states',
    color=st,
    color_continuous_scale='Plasma',
    scope="usa",
    labels={'color':'growth'})
fig.show()

The details are slightly complicated, but not bad. 

df is a pandas data frame that maps states by two-letter abbreviation to the corresponding statistic.
df = pd.DataFrame(data={'state':abbrev, 'value':st})
You need GeoJSON data for a county map (the states are already known to plotly.express).  That data file is available from them.

The colors are mapped to the statistic st as read from the data frame.  The last line of the call to px.choropleth assigns the title to the color bar.

The script for the US states is here and for the counties it is here.

This is the state of South Carolina today.  These colors are an attempt to make the positives pop out more.



There's much more to discuss.  I have always wanted to make a map of the US with my road trips plotted on it, something like this.  For that we need to talk about GeoJSON data and how to obtain it, as well as the Albers projection that is used in making maps.  It turns out that the standard methods from plotly have a significant limitation and I had a really weird bug in my code that I eventually figured out.

Finally, we'll need to find how to generate the data for each individual trip to overlay on the map.  That's all for later.



Monday, May 18, 2020

Virus attenuation

Vaccines and IgA

It is widely held that attenuated live-virus vaccines are the best vaccines, at least for respiratory diseases, because they are able to induce an IgA response.  Plus, nearly all viruses enter the host via mucosal surfaces (oral cavity, nose, gut, lungs), where the first line of defense includes IgA.

Poliovirus is the exception that proves the rule.  Polio grows extensively in the intestine, hence its transmission by the fecal-oral route, resulting in a unique epidemiological role for swimming pools in advanced countries.  However, polio doesn't seem to cause much pathology there.  The major trouble occurs when it moves to the nervous system.

OPV (oral polio vaccine) induces IgA while IPV (inactivated polio vaccine) does not.  This restriction is not a serious problem for the inactivated virus vaccine because it seems that IgG is sufficient to prevent polio's movement to the nervous system.



There is some prospect for development of adjuvants that would drive B-cells to switch to producing IgA.  One of the best understood is cholera toxin, where binding to certain cells stimulates IgA-promoting cytokines (IL-1, IL-6, IL-10).  However these are too toxic for use in humans.  Perhaps someday we'll understand enough to be able get an IgA response with a killed virus, but that moment has not come yet.  ref

Virus attenuation

We are left with the fact that historically, live virus vaccines have mainly been produced via attenuation.  We can look at some modern approaches elsewhere, like smallpox or adenovirus or retroviruses that are already attenuated and can display viral antigens.

Yellow Fever (YFV) virus ("the black vomit") is now a disease of the tropics, but it was once common in the US.  10% of Philadelphia's population was lost in an epidemic in 1793, and even more in New Orleans some years later.

In the mid-1930s, Max Theiler found that YFV could grow in mouse embryos.  So it was passaged from one mouse embryo to another, and then it was found that, somehow, the virus acquired the ability to grow in chicken embryos.  So one general approach is to try to grow the virus in some kind of cells (anything):  human, if necessary, or monkeys or mice and then adapt them to grow in chicken cells.

The chicken cells can either be in a whole embryonated hen's egg (i.e with a growing embryo), or cells growing in a culture dish.  Here is a picture showing the different sites within the egg that are suited for different viruses (I believe these are mostly viruses that have been adapted to grow in eggs already).


I haven't read enough to know, but I would suspect that cultured cells for virus were usually CEF (chick embryo fibroblasts), which are easy to prepare.  You mince the embryo, first removing the head, and put the pieces in culture.  A few days later you harvest the growing cells by treatment with the enzyme, trypsin.  A few cells are transferred to a new flask.  Now, you have fibroblasts which will grow for a number of generations, and no more embryo.  These are called primary cultures.

Today a number of cell lines have been derived from chicken that will divide forever.  There is a famous cell line from humans called HeLa which you may have read about.

The measles virus (MV) was first grown in human kidneys in tissue culture, then in human placentas in tissue culture, and then in chicken eggs.  Later, it was adapted to primary cultures of CEF.  In rare cases cells derived from an aborted human embryo have been used.

Another approach is to adapt the virus to growth at lower temperatures.  Due to paywall restrictions, I haven't been able to read much of the literature on this, but I believe it was done in CEF growing at 25°C.  There is a well-known influenza live virus vaccine of this type.

So, a virus that normally infects humans and causes disease is adapted to grow in chicken cells, or adapted to grow at 25°C instead of 37°C, or both.  Afterward, you may find that the procedure yields a virus that no longer grows very well and does not cause disease in humans.

This may be because a virus can specialize in one or the other but not both.  Or it may be that during prolonged replication mutations accumulate that affect grow under the original condition, but these are not selected against as they would be in the original host or at the original temperature.

Molecular biology

Surprisingly little is known about the molecular basis of attenuation.  Probably that's because such work requires a system for reverse genetics.  That would be some DNA-based clone where the mutations to be tested could be introduced, followed by a method to produce the live (typically RNA) virus.  I've written about a new system of this type for SARS-CoV-2 where the clones are maintained in yeast.

In addition, the gold-standard would be to test the virus in primates like monkeys.  That's really expensive and would need to fully justified.  Without a strong need to know, it might be hard to get approval for such a study today.

One case where new antigens are substituted into an attenuated virus is the live influenza vaccine.

Influenza is a segmented virus.  If two different strains of influenza infect the same cell, you can get reassortment.  Suppose we start with a known attenuated mutant (due to changes in PB1 and PB2), and coinfect with an influenza virus whose HA and NA we want in the new vaccine.  Just take the progeny viruses, clone them (propagate descendants from isolated single viruses), and then choose the one with the right genes:  PB1 and PB2 from virus 1, and HA and NA from virus 2.


In summary, attenuation has been widely used but is still more magic than science.  The best thing would be if an attenuated vaccine for the original SARS had been developed.  Then you could just substitute the new RBD (receptor binding domain) and try it out.  Unfortunately, it doesn't seem that was ever accomplished.



Tuesday, May 12, 2020

Measles virus

This is the first of a short series of posts in which I give some basic information on viruses that cause serious disease in humans.  It's motivated by the current pandemic with SARS-CoV-2.

The focus is on viruses that are human-specific but may have "jumped" species, the development of live attenuated virus vaccines, and the general nature of host restriction.

In this post, we'll talk about measles.  Measles is commonly known as rubeola (not to be confused with rubella), but also red measles and "English measles".

Morphology

Measles virus is a single-stranded negative-sense, enveloped RNA virus.  The measles virus is a member of the Paramyxoviridae, family Morbillovirus.  Below on the left is measles virus, its relative mumps virus (also a Paramyxovirus) is on the right.




Pathogenesis

Typically the first symptoms of measles include high fever of about 4 days duration, a characteristic rash, and what are called the "three C's":  cough, coryza (runny nose) and conjunctivitis.



Here are two pictures showing the characteristic rash.



The rash is called "flat" because
A maculopapular rash is a type of rash characterized by a flat, red area on the skin that is covered with small confluent bumps. It may only appear red in lighter-skinned people. The term "maculopapular" is a compound: macules are small, flat discolored spots on the surface of the skin; and papules are small, raised bumps. It is also described as erythematous, or red.
There is a special pathognomonic sign called Koplik's spots, seen inside the mouth on the cheek next to the molars.  Pathognomonic means it is characteristic enough to make a diagnosis by itself.  However, the spots are transitory and frequently missed.



Epidemiology

Measles is a highly contagious infectious disease.  Nine out of ten people who are not immune and share living space with an infected person will be infected. People are infectious to others from four days before to four days after the start of the rash.

Not only are presymptomatic individuals infectious, but the virus spreads by aerosol, meaning that it can survive in an infectious state even in the very small droplets that waft around and don't fall to the ground within a few minutes.  The particles can stay airborne for hours.  This ability is unusual, and indicates the virus resists inactivation due to drying out.

The CFR (case fatality rate) ranges from 1-3/1000 for a well-nourished, healthy individual, to as much as 10% or more, for other populations.  Vitamin A-deficiency is very problematic, and supplementation is recommended.

According to wikipedia, measles killed 20 percent of Hawaii's population in the 1850s. In 1875, measles killed over 40,000 Fijians, approximately one-third of the population.

Typically the first tissue infected is the lining of the airways, but the virus eventually travels through lymph nodes, infects cells of the immune system, and then moves into the blood causing widespread viremia.

Bacterial pneumonia is one of the common sequelae, and that's what most people die from.  Other problems include ear infections, blindness, severe diarrhea, encephalitis (1/1000) and problems in pregnancy.  In very rare cases (1/1M), measles can reactivate years later to cause SSPE (subacute sclerosing pan-encephalitis).

Host restriction
Measles virus infection is presumed to be sustained through an unbroken chain of human-to-human transmission, and no animal or environmental reservoir is known to exist. However, nonhuman primates can be infected with measles virus and can develop an illness similar to measles in humans with rash, coryza, and conjunctivitis. Many primate species are susceptible to measles virus infection, including Macaca mulatta ... Much of the evidence for the susceptibility of these nonhuman primates comes from laboratory colonies and the use of nonhuman primates as animal models for the study of measles virus pathogenesis.
One of the most interesting aspects of measles epidemiology is that the virus is so infectious, it runs out of hosts if the population is too small.  With a larger population, it comes back every few years as a new crop of susceptible hosts develops.
To provide a sufficient number of new susceptibles through births to maintain measles virus transmission in humans, a population size of several hundred thousand persons with ∼5000–10,000 births per year is required
Surveys of wild populations have sometimes revealed non-human primates with antibodies to measles virus.  It is believed that the virus was spread from humans to one of these animals, followed by limited spread and then die-out.

ref

Vaccine

The first laboratory to grow the virus was that of John Enders and colleagues.  They also were first to culture poliovirus, which lead to work on the vaccines by Salk and Sabin.  Enders et al received the Nobel Prize in 1954 for this work.

The vaccine strain is named for the boy from whom that virus was cultured, Edmonston.

The virus was weakened by successive culture in
- human kidneys
- human placenta
- hen's eggs
- chick embryos

Although significantly weakened by this serial culture,  it still caused rash and fever, sometimes high enough so that children had seizures.

The first thing Hilleman did was give the vaccine together with gamma globulin from people who had recovered from measles.  He then passed Enders' measles vaccine strain through
chick embryo cells more than 40 times.

Vaccinated is a biography of Hilleman.  It tells the story of Hilleman obtaining specially-bred chickens that were free of chicken leukemia virus.

The vaccine is highly effective.


Despite significant diversity of virus isolates, Measles virus remains a monotypic virus for which protective immunity is induced by vaccine strains first isolated in the 1950’s.

Origin

The genus Morbillivirus includes similar viruses that infect dogs, cats, whales, seals and cattle.  The disease of cattle is referred to by a Boer term:  Rinderpest.  Of the relatives, the rinderpest virus is the closest to measles.

At NCBI I searched for measles and found 363 genome nucleotide sequences, most of which appear complete.  I just chose one at random, NC_001498, and then got the sequence of its nucleocapsid gene, NP_056918.  A BLAST search gave a large number of hits with other Measles virus isolates down to 97% identity.

Restricting the search to Rinderpest (taxid: 11241), I got numerous hits as well, in the range of 75-80% identity.  But if you look at the alignments, there is a C-terminal region that diverges.  The N-terminal 400 aa (of 524) matches very well.  Restricting the search to 1-400 the matches were more like 88% identical, like this one:



















Here is a phylogenetic tree of Morbilliviruses from this review:



One can often recognize present-day diseases in descriptions from ancient times, but measles is missing from those accounts.  The first systematic description of measles, and its distinction from smallpox and chickenpox, is credited to the Persian physician Rhazes (860–932), who published The Book of Smallpox and Measles.

By analyzing the diversity of the sequences of viral isolates, it is believed that the last common ancestor of Measles virus and Rinderpest occurred about 1000 AD plus or minus.  It is also thought that the virus "jumped" from cattle to humans, due to domestication of livestock and growth of the human population to a level that could support the virus.

Links

Enders Nobel prize and lecture
Enders biography
Hilleman biography
Hilleman obit

Monday, May 11, 2020

Introduction to animal viruses

A long time ago, in a place far far away, I used to give lectures to medical students about microbial physiology and genetics.  I also gave two lectures introducing the viruses that infect humans:  not much about disease yet, but morphology and replication strategies, and so on.

Here is one figure I used, it is a cartoon of what various RNA viruses look like in the EM, drawn to scale.  (It's from Lange).  The morphology is quite diverse.



Our attention is currently focused on the one in the middle of the top row.  Now, there are a lot of properties that viruses have:  is the genome RNA or DNA, single- or double-stranded and so on.  Also shape of the capsid, lipid envelope or not.

I had a hard time remembering all this stuff (I actually could not), so I made up a picture that was successful.  This is the general organization for RNA viruses.


Here is how I used it.
So the idea is, you remember the order of the different viruses in the table, + sense on top, - sense underneath, with one double-stranded at the end of the second row.

Then you memorize a pattern of active dots for each property that you need to know.  I required them to know which viruses were enveloped, and which had segmented genomes.

There is another thing that's a bit of a detail, but some may find it useful.  There is so much material in lectures, especially to medical students, that it has been described as "trying to drink from a fire hose."  I am very sympathetic to them on this issue.  I adopted the strategy to color-code text on slides:  blue means you must know this, black means it's important and you may need to know this, and gray means it's something I want to talk about but you do not need to know this.  And then another color, like salmon, for the title of a slide to tell what it's about.

Here's another summary slide showing the Arboviruses.  These are "arthropod-borne" viruses (i.e. insect-borne).

This is the sum total of Coronavirus information (characteristics of the infectious process were taught later, in the context of lung infections).


Finally, here's another cartoon of DNA viruses.


I was pretty proud of myself for coming up with that aid to memory.

I have put links to the two lectures up on Dropbox.  I can't guarantee they'll stay up for ever, but we'll see.  Introduction to viruses   Virus systematics

Sunday, May 10, 2020

Electrophoresis

Take a rectangular tank and epoxy in electrodes at opposing ends, like this


Add an appropriate buffer, and then hook the electrodes up to + and - output from a regulated power supply, and you will get a voltage between the two wires that causes a current to flow.  Biological molecules carry electric charge, so they will move too, in an appropriate supporting medium.

Electrodes are made from thin platinum wire.  Since the molecules are negatively charged, they move toward the cathode (red).



Gels

Two materials have traditionally been used for gels:  polyacrylamide and agarose.  Acrylamide is on the left, and its polymerized form is on the right.

 
Acrylamide is cross-linked into a mesh by the inclusion of a small amount of bis-acrylamide (technically, N,N' methylene-bisacrylamide).





You mix a solution of acrylamide and bis-acrylamide plus the appropriate buffer, then the reaction is started by addition of a small amount of TEMED and ammonium persulfate (5-20% is a range for acrylamide).  The mixture is poured into a mold (glass plates separated by spacers, with something at the bottom to keep the liquid from running out.

Once it's set, the plug at the bottom is removed.  Electrical continuity is maintained by wedging a piece of sponge into the bottom.

A gel mold contains two glass plates, one of which is notched.  The "ears" of the notched plate have a tendency to get broken, so usually a set comes with an extra notched plate when you buy them.

The other material is agarose.  This is a purified form of the agar that is used for bacteriological plates (petris dishes).  Agar is a polysaccharide extracted from certain kinds of seaweed.  Agar has been used to solidify desserts for a long time, it was introduced to Koch's laboratory by the wife of one of his assistants, who knew about it.  He never publicly credited her with the idea.

[ It's amusing that the very first medium used for isolation of single colonies of bacteria was a potato, sliced through.  Appropriate for a German laboratory, I think. ]

Biochemically agarose is a repeating polymer of dimers of galactose plus a galactose derivative.

Agar and agarose have the property that when mixed with water, boiled and then cooled, the material sets into a gel (like jello, but generally stiffer) at around 45°C.  Once set, it can be heated a lot higher without losing its physical properties.  The stiffness depends on the concentration of agarose used.  0.8-1.5% would be usual

Agarose gel electrophoresis also requires an appropriate buffer (e.g. Tris-acetate).  This type of electrophoresis is extremely convenient.  The gel is non-toxic, easily prepared by boiling, and the gel can be poured flat (see the picture above).  You can't do this with polyacrylamide because oxygen inhibits the polymerization reaction.

Samples are loaded into wells under the surface of the buffer.  The aqueous samples are made dense by addition of glycerol to 10% or so.  Dyes (bromophenol blue and sometimes xylene cyanol) are also added into the samples, they move at characteristic rates and allow you to visualize the progress of the separation.

Separation in electrophoresis

For DNA at neutral pH, charge is carried by the phosphate groups, which contribute 1 or 2 negative charges (the average depends on the exact pH).  This means that the charge to mass ratio is constant for DNA or RNA of different lengths.

The reason that DNA or RNA molecules of different sizes separate is the existence of a retarding force that is greater for larger molecules.  Or maybe it's better to turn that around:  we observe that the log of the distance traveled is inversely proportional to the length of the molecule, and infer the existence of a force that depends on length.

Samples for protein gels are typically prepared by boiling a protein mixture in the presence of a detergent SDS (sodium dodecyl sulfate).  The hydrophobic part coats the protein and destroys its secondary structure.  The evenly spaced sulfate groups impart negative charge.  As with DNA,  the charge to mass ratio is constant for polypeptides of different lengths.  Separation occurs by means of the size-dependence of the retarding force.

Visualization

The classic method for visualizing protein gels is to stain with a blue dye (Coomassie brilliant blue).  In this picture we can see a protein gel drying after the electrophoresis has been run.  The blue spots are proteins.

DNA gels were often stained with a fluorescent dye such as ethidium bromide.

Ethidium is moderately mutagenic, so substitutes have been developed.

Alternatively if the material is radioactive, you just expose the dried (or even wet) gel to X-ray film.  These days, they have fancy apparatus that records the emitted beta particles without the use of film.  I remember the revolution caused by the introduction of automatic film processors.

Laemmli

To get the best resolution, you want the bands of protein or nucleic acid to be as thin as possible.  Here is a gel with very nice resolution:


The thickness of the bands depends on how much of each protein is present in the sample.

To get a pretty gel (one with nice thin bands), for DNA or RNA the important thing is to have as little sample as possible and to run a thin gel (like 0.4 mm).

For protein gels, there is a trick, invented by Laemmli.  There is a combination of two gels, one on top called the stacking gel, and a larger one below called the running gel  The system has 3 different buffers.  The upper and lower tank buffers contain glycine as the mobile anion and are at pH 8.8.  The gels are

Stacking gel:  3% acrylamide, pH 6.8
Running gel:  5% - 20% acrylamide, pH 8.8

This system compresses a sample which might be almost a centimeter from top to bottom when first loaded, into a set of protein bands much less than one mm thick as they exit the stacking gel.

One last thing:  a lab running protein gels will have a characteristic smell of sulfur.  That's because a sulfhydryl reagent like beta-mercaptoethanol will be present in the samples to break disulfide bonds in the proteins.  It's minimally dangerous in small quantities, but these days the safety police make you boil your samples in a fume hood.