Thursday, March 3, 2011

Handling large sequence sets (5)

In this post I want to show some comparisons for HITS in genes before and after selection in the murine lung. We're continuing in a series, first post here.

Before we start, I'm sorry if anyone was confused by or spammed by my multiple additions to the third post (here). I had some formatting errors in the first version, and then had to rewrite some "throwaway" filter scripts that I had thrown away. Also, re-running the filters, I realized that the second one had a stupid approach, so I rewrote it completely. I don't think anyone gets this as an RSS feed, but if you did, I guess you stopped that.

I think we're OK now, except for one other error that I made in the previous analysis, which was that I didn't correct the insertion position for the direction of the read. If the insertion looks like this:

GATC[ transposon ]TCGA

Reading to the left we would identify the insertion as nt 4, whereas going to the right it would be nt 5. We want these to be identical. So..for ccw reads we'll add 1 to the insertion position. Of course, if the alignment isn't perfect at the beginning we'll still be off, but I'm not going to worry about that now. The scripts are all on on Dropbox (here), to be run in order:

To get ready for today, I went back and processed all three of the data files (SD1.txt, SD2.txt and SD3.txt). After BLAST analysis and filtering, we have:

> ls -al small*.txt
-rw-r--r-- 1 telliott_admin staff 11605337 Mar 3 10:41 small1.txt
-rw-r--r-- 1 telliott_admin staff 11531846 Mar 3 10:43 small2.txt
-rw-r--r-- 1 telliott_admin staff 5288194 Mar 3 10:43 small3.txt

SD1 and SD2 are independent reads of the library before selection.
SD3 is after selection.

After running the scripts, I have the results in library.genehits.txt and lung.genehits.txt, and I put those up in the scripts folder, so you can just start at this point if you want to follow along. There is also a file with the gene assignments (genes.txt). If you want to run the whole analysis yourself, you'll have to make substitutions for filenames within the scripts and for invocation on the command line. Drop me a line if you need help.

Finally, I wrote an ugly script (also in the folder) that plots a selected gene using matplotlib. HITS from the pre-selection library are in blue, and after selection in the lung in red. The position of the gene of interest is shown by a blue rectangle. There are many fewer sequences in the red sample, so the counts were multiplied by a scaling factor (4) to make them comparable to the blue values. At the top of the post is the rfaF gene, which is required for growth or survival in the lung.



and a control, hemH

I have one more goal for this series, and that is to generate Fig 2 from the paper. (PMID 19805314).