Python for Bioinformatics: October 2010

Saturday, October 30, 2010

Additive variance for addition and subtraction

I've learned that

"for the sum or difference of independent random variables, variances add"

(source)

Here is a Python simulation with numbers drawn from a normal distribution with mean = 0 and std = 2. As expected, the actual variance observed for a sample is not exactly equal to 4, though it approaches this as N increases. What was a surprise to me is that the statement "variances add" is only approximately true for actual data. (That is, unless I've screwed up somehow). Looking back, I saw this before but ignored it.

Here are trials for various N, showing the mean, variance and standard deviations of the samples, the sums of the actual variances, and the variances of the sum and difference.

[UPDATE: I handled the printout of N's value poorly. For example, in the interpreter:

>>> 10e2
1000.0

for N = 100, when what I actually did in code was 10**2. ]

    mean       var       std

N = 10e2
   -0.009     3.957     1.989  data1
    0.339     3.192     1.787  data2
    0.330     7.149     3.776  total
add
    0.330     7.101     2.665  combo
sub
   -0.348     7.197     2.683  combo
------------------------------
N = 10e3
    0.066     3.727     1.931  data1
    0.009     4.027     2.007  data2
    0.076     7.754     3.937  total
add
    0.076     7.753     2.784  combo
sub
    0.057     7.755     2.785  combo
------------------------------
N = 10e4
   -0.025     4.007     2.002  data1
    0.041     4.089     2.022  data2
    0.015     8.097     4.024  total
add
    0.015     8.087     2.844  combo
sub
   -0.066     8.106     2.847  combo
------------------------------
N = 10e7
    0.000     4.001     2.000  data1
    0.001     3.999     2.000  data2
    0.001     8.000     4.000  total
add
    0.001     8.001     2.829  combo
sub
   -0.000     7.998     2.828  combo
------------------------------

import numpy as np

def show(stats,name=None):
    L = [' %3.3f ' % e for e in stats]
    print ''.join([e.rjust(10) for e in L]),
    if name:  print name
    else:     print 

def get_stats(data):
    funcs = [np.mean,np.var,np.std]
    return np.array([f(data) for f in funcs])

def init(N):
    total = np.zeros(3)
    f = np.random.normal
    rL = list()
    for i in range(2):
        rL.append(f(loc=0,scale=2,size=N))
        stats = get_stats(rL[-1])
        show(stats,'data'+str(i+1))
        total += stats
    show(total,'total')
    return rL
    
def add(A1,A2):  return A1 + A2
def sub(A1,A2):  return A1 - A2
#---------------------------------------
for e in [2,3,4,7]:
    N = 10**e
    print 'N = 10e' + str(e)
    data1, data2 = init(N)
    for f in [add,sub]:
        print f.func_name
        combo = f(data1,data2)
        show(get_stats(combo),'combo')
    print '-'*30

Friday, October 29, 2010

Another head-scratcher

I know about keywords in Python. You can't do this:

>>> and = 3
  File "<stdin>", line 1
    and = 3
      ^
SyntaxError: invalid syntax

But I do like simple names for objects and functions. I acquired this habit from Stroustrup's book on the C++ language, 10 years ago, before I even knew about Python.

However, I must remember not to make the names too familiar. For example, yesterday I rashly named a script new.py. And this single line caused problems that had me scratching my head:

import matplotlib.pyplot as plt

Apparently matplotlib also uses the name new for a module (although I haven't actually located it):

Traceback (most recent call last):
  File "new.py", line 1, in 
    import matplotlib.pyplot as plt
  File "/Library/Python/2.6/site-packages/matplotlib/pyplot.py", line 6, in 
    from matplotlib.figure import Figure, figaspect
  File "/Library/Python/2.6/site-packages/matplotlib/figure.py", line 18, in 
    from axes import Axes, SubplotBase, subplot_class_factory
  File "/Library/Python/2.6/site-packages/matplotlib/axes.py", line 2, in 
    import math, sys, warnings, datetime, new
  File "/Users/telliott_admin/Desktop/new.py", line 1, in 
    import matplotlib.pyplot as plt
AttributeError: 'module' object has no attribute 'pyplot'

[UPDATE:

new.py is not a matplotlib module, it's in the standard library, but it's deprecated. It provides a way to instantiate a new object without calling __init__. I'm not sure what situation you would want to use it for other than what it says in the docs.

There is one use of new in axes.py but I don't understand it:

_subplot_classes = {}
def subplot_class_factory(axes_class=None):
    # This makes a new class that inherits from SubclassBase and the
    # given axes_class (which is assumed to be a subclass of Axes).
    # This is perhaps a little bit roundabout to make a new class on
    # the fly like this, but it means that a new Subplot class does
    # not have to be created for every type of Axes.
    if axes_class is None:
        axes_class = Axes

    new_class = _subplot_classes.get(axes_class)
    if new_class is None:
        new_class = new.classobj("%sSubplot" % (axes_class.__name__),
                                 (SubplotBase, axes_class),
                                 {'_axes_class': axes_class})
        _subplot_classes[axes_class] = new_class

    return new_class

Wednesday, October 27, 2010

More on sampling a distribution

Some time ago I had a post about sampling from a distribution. More specifically, the problem is to generate random samples given a cdf (cumulative distribution function). There have been a couple of useful comments, and I'd like to extend the explanation as well.

Previously, a function was defined that can be used to generate a probability for any value of a random variable x, from a probability density function or distribution (pdf) with a given mean and standard deviation. We remember that for a continuous distribution, the actual probability at any discrete x is zero, since the total number of possible x's is infinite. Technically the definition is that the pdf is the derivative of the cdf, which I misremembered as the cumulative density function. So the cdf(x) is the area under the pdf from negative infinity to x.

One nice thing about this approach is it is then easy to define a sum of weighted distributions.

We get what looks like a smooth curve by plotting a (relatively) large number of points (in this example, 3502). The plotting uses matplotlib (see this post referencing set-up on the Mac). The cdf is computed by simply accumulating values from the pdf. Normalization is usually done by dividing by the total, but the method I showed was just slightly more subtle:

cdf *= dx

.

Unfortunately, it is also wrong! (Sorry). I won't try to explain what I was thinking, but the fact that it gave an accurate value (as shown by the maximum of the cdf being equal to 1) is an accident, and you should instead simply divide by the sum of the values, and moreover, do the operation on the pdf before constructing the cdf:

pdf /= sum(pdf)

This change to the pdf means we need to magnify it before plotting, and really, should provide a different y-axis on the right hand side, with the true values. The left-hand y-axis only is accurate only for the cdf.

The idea for sampling is to generate a random float using np.random.random(), and then ask which of the values in the cdf (which by definition are ordered), first exceeds this value. The indexes resulting from repeating this procedure are concentrated in the steep part of the cdf (the peaks of the pdf), because the probability that a given position in our "discretized" form of the cdf satisfies this relationship is proportional to the slope of the cdf curve (the added vertical distance between a given index and the one previous).

As a reader suggested, an improvement to the code is to recognize that the list we're searching (the cdf) is ordered and so can more efficiently be searched using a binary search. The code is a little tricky to write, so I skipped it last time. And luckily, Python comes with "batteries included" and for this application what we want is the bisect module from the standard library. The example find_le function's docstring says: 'Find rightmost value less than or equal to x'. I just modified this code (which calls bisect.bisect_right) to return the index rather than the value.

We're interested in intervals where the slope is steepest. The original find_first function returned the index of the right-hand value, while this function will return the index of the left-hand value. I suppose either one is fine, but perhaps it would be better to use the midpoint.

There are a few more steps in the code that are a little obscure, including bins with fractional width, and the use of the Counter class to organize the data for the histogram. But this post is getting a bit long so I'll skip them for now.

Modified code:

import math, sys, bisect
import numpy as np
import matplotlib.pyplot as plt
import Counter

def normal(mu,sigma):
    def f(x):
        z = 1.0*(x-mu)/sigma
        e = math.e**(-0.5*z**2)
        C = math.sqrt(2*math.pi)*sigma
        return 1.0*e/C
    return f

p1 = normal(0,2)
p2 = normal(10,1)
p3 = normal(18,0.5)

# sum of weighted normal distributions
@np.vectorize
def p(x):
    return 0.5*p1(x) + 0.25*p2(x) + 0.25*p3(x)

dx = 0.01
xmax = 25    
R = np.arange(-10,xmax+dx,dx)
# dashed lines
plt.plot((R[0],R[-1]),(0,0),color='k',ls=':',lw=2)
plt.plot((R[0],R[-1]),(1,1),color='k',ls=':',lw=2)
plt.plot((R[0],R[-1]),(-0.5,-0.5),color='k',lw=2)

pdf = p(R)
S = sum(pdf)
pdf /= S
plt.plot(R,pdf*(len(R)/10.0),color='r',lw=3)
#print len(R)
S = sum(pdf)
print S
#===========================================
cdf = [pdf[0]]
for e in pdf[1:]:
    cdf.append(cdf[-1] + e)
cdf = np.array(cdf)
#cdf /= S
plt.plot(R,cdf,color='b',lw=3)

ax = plt.axes()
ax.set_xlim(-6,xmax)
ax.set_ylim(-0.55,1.05)
#===========================================
def find_le(n,L):
    'modified from bisect.find_le'
    i = bisect.bisect_right(L,n)
    if i:
        return i
    raise ValueError

samples = list()
width = 0.2
f = 1/width
for i in range(10000):
    n = np.random.random()
    # must adjust to actual range
    value = find_le(n,cdf)*dx - 10.0
    # trick to truncate at fractional values
    samples.append(np.floor(f*value)/f)
#===========================================
c = Counter.Counter(samples)
maxn = c.most_common(1)[0][1]
for k in c:
    n = c[k]
    n = n * 0.45 / maxn
    r = plt.Rectangle((k,-0.5),
                      width=width,height=n,
                      facecolor='magenta')
    ax.add_patch(r)
plt.savefig('example.png')

Thursday, October 21, 2010

Progressive politics

Mark Kleiman had a post this morning in which he says

Evolution is just like global warming: each has such overwhelming scientific evidence behind it that denial of either one is strong evidence that the speaker is either ot-nay oo-tay ight-bray or blinded by some prejudice.

This would be a typical statement from "the reality-based community" but it glosses over a point which should be remembered.

Evolution as a theory is on a par with the theory of atoms. The weight of evidence is literally overwhelming. Imagine that TV-personality like Glenn Beck declared his non-acceptance of the theory of atoms. Obviously, no one would care. So why do the media care when he says the equivalent about evolution? It makes no sense from a reality-based perspective, it's just political theater. The really odd thing about evolution is that the average person seem to believe that the evidence can be evaluated by amateurs like himself.

Climate change is more complicated. The evidence that the earth's temperature is rising is unequivocal, according to the experts. The real question at issue is: what part of that change is due to human activity? And perhaps more important for me, what is the prospect for changes in policy in the U.S. or around the world that would have an impact on our carbon footprint.

I remember seeing a video featuring Steven Chu in which he said that his reading of the experts was that there's about an 80% chance that the change is due to human activity. Unfortunately, I can't find the link now, so here's another video in which he gives an estimate of a 50% chance of a 3°C temperature rise in this century and talks about what that would mean. In short, chaos.

With respect to policy, raw political advantage trumps good policy for the republicans (at least right now), so we will likely not achieve anything on this front, even by mechanisms like "cap and trade" that were originally republican "market-driven" ideas. Also, of the estimated roughly 900 billion tons of coal in the world, a substantial fraction is in China, and I expect them to burn all of that.

Saturday, October 16, 2010

Is Heinlein a Bayesian?

Brad DeLong had a post yesterday about Robert Heinlein, who I read extensively as a teenager and enjoyed tremendously. The subject was The Moon Is A Harsh Mistress, which I thought was a great story. I like Brad's blog a lot, and learn a lot about economic issues from him, but I thought the post so wrong as to make me question whether he is as knowledgable on other topics as he seems to be.

DeLong says "Heinlein [has a] failure to understand the Law of the Reverend Thomas Bayes," and then approvingly quotes another writer who complains that you can't compute the odds for an event and come up with different answers at different times. But of course this is the whole point of Bayes. If you understand nothing else about the issue, you should realize that what Bayes makes possible is to update probabilities as new evidence becomes available.

So I decided to say something in a comment, nicely I thought. I'm quite mystified to find my comment has been deleted a few hours later. I'll shrug my shoulders and go on, but I am shocked, shocked, first, that Brad DeLong is clueless about Bayes, second, that a whole string of commenters has missed the central point, and then finally that a polite comment pointing this out should be thought subversive. Very weird indeed, and disheartening.

[UPDATE: Perhaps it's just a misunderstanding. The original post says:

I have two problems with Mike. One is the figuring the odds for the revolution. I’d have bought it if he did it once. It’s the complex refiguring and odds changing and—no. People complain about the Dust hypothesis in Permutation City that you can’t calculate things out of order, and this is worse. You can’t work out odds of 7 to 1 against and then say they will keep getting worse until they get better.

The statement about "complex refiguring and odds changing and—no" is just wrong. And you can calculate in any order you like.

But it turns out Brad's objection and maybe this poster's is that they read Heinlein as saying " [the odds] will keep getting worse until they get better." You can't keep getting installments of bad news and say things are looking up.

But that's not how I remember the book. Mike knows things, everything public anyway. Some events break the revolutionists' way early, but a thousand small unmentioned things change the balance. And time is running out. Read the book, and judge for yourself. One of Heinlein's best, along with my favorite, Stranger in a Strange Land.]

[UPDATE 2:

I dug into it a little bit and I am convinced that I'm right. The original post is a very nice review of the book. Great story, but lots of not-PC stuff, characteristic Heinlein. However, it is clear that (1) Jo Walton does not understand Bayes (hence the first part of the quote above) and (2) Heinlein does, at least to the extent he knows how to bet on a horse race, and can appropriately update odds when in possession of new information.

I looked through an online version of the book (pdf) and read all the pages where the word "odds" appears, and nearby. For an example showing Heinlein's understanding, you might read the discussion about the comparative advantage of various strategies depending on the selected time frame (p. 57-58).

The sequence that Brad DeLong thinks damning, that the revolutionists' odds of success are improved after the trip to earth, is just unclear. We're never told what Mike has been "thinking" during this time, or much about what's been happening overall. The evidence used to update the odds is not given to us.

The only counterargument that I could find against my interpretation is based on this dialog on p. 110:

As soon as Stu went Earthside, Mike set odds at one in thirteen. I asked him what in hell? “But, Man,” he explained patiently, “it increases risk. That it is necessary risk does not change the fact that risk is increased."

This doesn't make any sense. But I think Heinlein was probably just seduced by the fact that it sounds a good line. And maybe he is thinking about the time-frame dependence.

The whole rest of the book makes perfect sense from a Bayesian perspective.

Now, I'm not defending Heinlein's political philosophy. There is a lot to complain about (just look at the wikipedia entry for this book). But it's still a great story. Three things irritated me (from most to least important):

• the post held up as a model of correctness was itself wrong on the fundamentals
• DeLong wrongly criticized Heinlein as a mathematical simpleton
• he deleted my comment pointing this out

Too bad no one will ever know. ]