Sunday, May 11, 2008

Parsing Pubmed with ElementTree

Continuing with the theme of using stock Python to parse XML from NCBI, I wrote a script that parses Pubmed entries. It's pretty simple although it did take me a while to get started. Here is the first part of the parsing function:

def parseArticle(article):
    D = dict()
    D['pmid'] = article.findtext(
        'MedlineCitation/PMID')

    journalinfo = article.find(
        'MedlineCitation/MedlineJournalInfo')
    D['journal'] = journalinfo.findtext('MedlineTA')

    a = article.find(
        'MedlineCitation/Article')
    D['volume'] = a.findtext(
        'Journal/JournalIssue/Volume')

There is one thing about it that's hackish. If I have a file with a single article, or with multiple articles that I've copied and pasted from Genbank, there is no proper root element and ElementTree chokes when I call ElementTree.parse(filename). What I do is to catch the exception and handle it by pasting the data into a temporary file with the added root element. Then I feed the temporary file to ElementTree.

I did a Pubmed search for my graduate advisor (E.P. Geiduschek) and my post-doctoral mentor (J.R. Roth) and pasted all 388 records to the Clipboard, then from the clipboard I sent the XML to file. The script handles this input. The first part of the printout is:

title Dissection of the Bacteriophage T4 Late Promoter Complex.
authorList Nechaev S, Geiduschek EP
journal J Mol Biol
year 2008
volume None
pages
pmid 18455735
abstract Activated transcription of the bacteriophage T4 la