The last part of Ch. 2 talks about k-words with k > 2, e.g. codons. About 20 years ago people first noticed that the codons used by highly expressed genes seemed to be a subset of possible ones. I'll discuss why this should be in a later post, and I'll also postpone discussion of a test of the hypothesis
But that is where we are headed, using the genome sequence and array data on expression levels. The first thing is to get the sequences of all the E. coli genes. Rather than look around for this resource or use BioPython I decided to do it (quickly I thought) by hand. The difficulty is that this code is always finicky to write. I went to the page listing bacterial genomes at NCBI and grabbed the genome sequence (NC_000913) in text format, making sure to include the DNA sequence. Then I just ran the script below, which parses all CDS entries, obtains the gene name and sequence, and writes the result to disk. We filter out two complex genes (multiple ORFs), and throw away 85 entries which seem to be mis-annotated. The result contains 4157 gene sequences, which we save in a text file.
# file contains NC_000913 as GenBank format