Python for Bioinformatics: Deonier Ch 2 Codon Adaptation Index (1)

Saturday, August 9, 2008

Deonier Ch 2 Codon Adaptation Index (1)

The last part of Ch. 2 talks about k-words with k > 2, e.g. codons. About 20 years ago people first noticed that the codons used by highly expressed genes seemed to be a subset of possible ones. I'll discuss why this should be in a later post, and I'll also postpone discussion of a test of the hypothesis

But that is where we are headed, using the genome sequence and array data on expression levels. The first thing is to get the sequences of all the E. coli genes. Rather than look around for this resource or use BioPython I decided to do it (quickly I thought) by hand. The difficulty is that this code is always finicky to write. I went to the page listing bacterial genomes at NCBI and grabbed the genome sequence (NC_000913) in text format, making sure to include the DNA sequence. Then I just ran the script below, which parses all CDS entries, obtains the gene name and sequence, and writes the result to disk. We filter out two complex genes (multiple ORFs), and throw away 85 entries which seem to be mis-annotated. The result contains 4157 gene sequences, which we save in a text file.

# file contains NC_000913 as GenBank format
# has the sequence
import string, sys

data = open('EC.genome.txt','r').read().strip()
locus, rest = data.split('\n',1)
rest, seq = rest.split('ORIGIN')
seq = ''.join([c for c in seq if c in 'acgt'])
print locus[:50]
print len(seq)
#---------------------------------------------
def process(L):
    def getName(s):
        # e.g. '/gene="ileV"'
        return s.split('=')[1].replace('"','')
    def getCoord(s):
        # e.g. 'complement(287628..288386)'
        if '(' in s:
            s = s.split('(',1)[1].replace(')','')
            stop,start = s.split('..')
        else:
            start,stop = s.split('..')
        return start,stop
        
    # multiple segments
    if L[0].count('..') > 1:    return
    first, second = L[0],L[1]
    first = first.split()[1].strip()
    start,stop = getCoord(first)
    g = getName(second.strip())
    # global
    CDSList.append( (g, start, stop) )

# we use CDS b/c we know they are proteins
# we want the CDS coordinates (on same line)
# but we also want the CDS name
# usually on the next line but not always!

def getCDSList(s):
    lines = s.split('\n')
    line = lines.pop(0)
    D = dict()
    while line:
        if CDS in line:
            L = [line]
            L.append(lines.pop(0))
            if gene in L[-1]:
                process(L)
            else:   # split gene
                pass
        line = lines.pop(0)
#---------------------------------------------
def reversecomplement(s):
    tt = string.maketrans('ACGTacgt','TGCAtgca')
    s = string.translate(s,tt)
    L = list(s)
    L.reverse()
    return ''.join(L)
    
def groupSequence(seq,SZ=100):
    L = list()
    s = seq[:SZ]
    seq = seq[SZ:]
    while s:
        L.append(s)
        s = seq[:SZ]
        seq = seq[SZ:]
    return '\n'.join(L)
#---------------------------------------------
CDS = '     CDS  '
gene = '/gene'
CDSList = list()
getCDSList(rest)
results = list()

def show(g,s):    
    print g,s[:12] + '..' + s[-12:]

for t in CDSList:
    g, start, stop = t
    start = int(start)
    stop = int(stop)
    if start < stop:
        s = seq[start-1:stop]
    else:
        s = seq[stop-1:start]
        s = reversecomplement(s)
    s = s.upper()        
    # 85 are apparently mis-annotated
    # 29 of them have names e.g. pcnB
    if not s[:3] in ['ATG','GTG','TTG']\
    or not s[-3:] in ['TAG','TGA','TAA']:
        if not g[0] == 'y':  show(g,s)
    else:
        s = groupSequence(s)
        results.append('>' + g +'\n' + s)
        
print len(results)
FH = open('EC.genes.txt','w')
FH.write('\n\n'.join(results))
FH.close()