Python for Bioinformatics: Parsing image links in Blogger XML

Saturday, October 31, 2009

Parsing image links in Blogger XML

I wrote a script that I explained briefly in a previous post (download here: bloggerScript.py), which uses ElementTree in Python to parse content from a blogger XML archive for image links. If you're editing a post, or just using your browser in the normal way and looking at a page's source which contains an image, you'll see something like

<a onblur="try ...

The "a" element's name is short for "anchor", that is, a hyperlink. It has two "attributes" (I assume they are named the same in HTML as XML): some javascript code and href="http..." (the link to the full-res image), as well as child elements including <img... with its own attribute src="http...". Note that when displayed as content text, this anchor element has the < symbol rather than the escape code <, whereas in the source XML it is the escaped form. So the code for parsing image links is:

def parseImageLinks(s):
    links = list()
    i = s.find('<a')
    while i != -1:
        j = s.find('/a>',i+1)
        link = s[i+1:j]
        if link.startswith('a onblur'):
            links.append(link)
        i = s.find('<a',i+1)
    retL = list()
    for link in links:
        i = link.find('href')
        j = link.find('>')
        retL.append(link[i+6:j-1])
    return retL

We use the string method "find" to get the indexes flanking the substring of interest. A subtle bug that I had in a previous version was that I failed to specify i+1 in the line:

j = s.find('/a>',i+1)

The result was code that apparently worked, but found only the first image if there was more than one. This happens because, on the second time through the loop, i was correctly set to the start of the second element, but j was still the first end-tag, and with j < i, we get an empty string with s[i+1:j].

>>> s = 'abcde'
>>> s[3:2]
''

The second part of the function finds the href for each link and saves it in a list to return. I end up with a containing the title of each post that has at least one image, and all the image links. I just saved them in a text file.