Python for Bioinformatics: GC content

Sunday, August 10, 2008

GC content

Before I leave Ch 2 of Deonier behind, I have one last program to show. The goal is similar to previously, we want to look at GC-content in a window sliding across the genome. But we want to make the window smaller, and rather than look at the Chi-squared statistic, we'll just plot %GC. The first program I wrote generates this (when plotted in R):

The statistics (number of intervals with GC-content in the indicated range) indicate many more low GC segments than expected. I have to check this, but I think that for a binomial distribution with p = 0.5, the standard deviation is the square root of n * 0.5 * 0.5 or 5 for 100 trials. So, we'd expect 99% of random trials to have GC > 35%. What we observe is that 1595 / 46397 are < 35%, about 3.4%.

# window and step size
W = 100
results = list()
s = seq[:W]
seq = seq[W:]

start = time.time()
while s:
    results.append(
        str(s.count('G') + s.count('C')))
    s = seq[:W]
    seq = seq[W:]

print '%3.2f' % (time.time() - start)
# 254.33

However, as indicated by the comment, it takes an embarassingly long time to run---almost 4 minutes. This is the way I would normally step through a sequence but I got an idea to make it faster. That is to step through the sequence 1 nt at a time with two pointers separated by 100 nt. If the leading pointer hits G or C, we increment the count, and if the lagging pointer hits G or C we decrease it by one. We start with a GC content at the equilibrium value and save every 100 nt. Notably, it runs much faster than the other code, but the plot we generate is subtly different. Well actually, it's not that subtle---the average looks to be shifted up near 57% and the while the patterns are clearly related, they are not the same. I'm not sure why, at the moment.

# window and step size
W = 100
SZ = len(seq)
results = list()
print SZ

GC = 50
start = time.time()
for i in range(SZ-W):
    j = i + W
    nt = seq[i]
    if nt == 'G' or nt == 'C':
        GC -= 1
    nt = seq[j]
    if nt == 'G' or nt == 'C':
        GC += 1
    if not i % 100:
        results.append(str(GC))
        
print '%3.2f' % (time.time() - start)
# 10.47