Python for Bioinformatics: Kyte-Doolittle hydrophobicity plot in Python

Tuesday, March 30, 2010

Kyte-Doolittle hydrophobicity plot in Python

I had a request the other day to compute a hydrophobicity plot for a protein. The idea (Kyte & Doolittle, PMID 7108955) is to assign a "hydrophobicity" value to each amino acid. We slide a window along a protein and sum the values in the window, then plot the result. The plot shown above suggests the presence of six long hydrophobic stretches in this protein, which could span the membrane and thus mark a transition from "inside the cell" to "outside the cell" (or the reverse).

A   1.8 C   2.5 D  -3.5 E  -3.5
F   2.8 G  -0.4 H  -3.2 I   4.5
K  -3.9 L   3.8 M   1.9 N  -3.5
P  -1.6 Q  -3.5 R  -4.5 S  -0.8
T  -0.7 V   4.2 W  -0.9 Y  -1.3

I found a page at ExPASy that will do this.

It's pretty sophisticated, with 50 or so possible classifications (e.g. alpha-helix and beta-sheet). Since it's a fun (and simple) Python problem, I downloaded their data, as shown above. Most of the code (below) involves parsing the file with the data into a Python dictionary, and reading the file with the protein sequence. The code to scan the sequence is trivial.

As an example, I chose a protein analyzed by my friend Dana Boyd and colleagues, E. coli malG (Boyd 1993 PMID 8419303). They use alkaline phosphatase fusions to check the predictions of the topology model, as shown in this figure from the paper.

The idea with the fusions is that active alkaline phosphatase requires disulfide bonds, which only form in the periplasm (outside the cytoplasm). If we attach (fuse) alkaline phosphatase coding sequence at the points indicated by the arrows, we predict the fusions made at points predicted to be external to the cytoplasm will have high activity, while the internal ones will have low activity. As you can see, the results fit the model.

My plot using matplotlib is shown above.

import string
import numpy as np
import matplotlib.pyplot as plt

D = {'Ala':'A', 'Arg':'R', 'Asn':'N', 'Asp':'D',
     'Cys':'C', 'Gln':'Q', 'Glu':'E', 'Gly':'G',
     'His':'H', 'Ile':'I', 'Leu':'L', 'Lys':'K',
     'Met':'M', 'Phe':'F', 'Pro':'P', 'Ser':'S',
     'Thr':'T', 'Trp':'W', 'Tyr':'Y', 'Val':'V'}
     
fn = 'hydro.data.txt'
FH = open(fn,'r')
data = FH.read()
FH.close()
L = data.strip().split('\n')

HD = dict()
for e in L:
    aa, rest = e.split(':')
    rest = rest.strip()
    n = rest[rest.index('.')-1:]
    n = float(n)
    if '-' in rest:  n *= -1
    HD[D[aa]] = n
    
for i,k in enumerate(sorted(HD.keys())):
    if i and not i % 4:  print 
    print k, str(HD[k]).rjust(5),
print
#=============================================
fn = 'malG.seq.txt'
FH = open(fn,'r')
data = FH.read()
FH.close()
L = [c for c in data if c in string.lowercase]
prot = ''.join(L).upper()
#print prot[:50]
#=============================================
window = 15
delta = 2
rL = list()
for i in range(0,len(prot)-window+1,delta):
    pep = prot[i:i+window]
    L = [HD[aa] for aa in pep]
    rL.append((i, pep, sum(L)))

for e in rL:
    if e[2] > 20:  print e
X = np.arange(len(rL)) * delta
Y = [e[2] for e in rL]

plt.scatter(X,Y,s=40,color='r',zorder=2)
plt.plot(X,Y,'--',lw=1.5,color='k',zorder=1)
plt.savefig('example.png')