Continuing with the topic from the other day, processing of BLAST output, I was confused about usage of the blast parsers. As defined in
cogent.parse.blast
:and in
cogent.parse.blast_xml
:The first argument is named 'lines', but despite that (as suggested in comments), these functions are looking for a
file object
. That is, what we're going to do (after we do the BLAST, below) is this:L will then contain one item for each query sequence.
That item is a tuple with two elements
• a dictionary with meta-data
• a list of sub-lists
Each sub-list contains the data for an Hsp.
I'll do the test with a different db containing Achromobacter 16S rRNA seqs.
>AF411020.1
>EF672258.1
>EU373389.1
>DQ450530.1
>AJ278451.1
>AF411019.1
>AJ002809.1
You could grab these as described here. First, we have to format the db for BLAST.
This data is actually simple enough that you don't really need a parser. I will do this analysis from the interpreter, but I'm not showing the interpreter prompt '>>> ').
We parse as follows:
Data for the first hit looks like this:
Note that this format does not contain the aligned sequences.
Now, let's do it with XML-formatted ouptut:
Again, L contains one item for each query sequence. That item is a tuple with two elements
• a dictionary with meta-data
• a list of sub-lists
Each sub-list is the data for an Hsp, whose first element is a list of keys. This is a slightly different arrangement than for the first example.
And while the BLAST parsers have an input argument labeled data but want a file object, the Fasta parser has input arg labeled infile and wants a string! (and note: you must do readlines() rather than read()).