Saturday, October 31, 2009

Parsing image links in Blogger XML



I wrote a script that I explained briefly in a previous post (download here: bloggerScript.py), which uses ElementTree in Python to parse content from a blogger XML archive for image links. If you're editing a post, or just using your browser in the normal way and looking at a page's source which contains an image, you'll see something like

<a onblur="try ...

The "a" element's name is short for "anchor", that is, a hyperlink. It has two "attributes" (I assume they are named the same in HTML as XML): some javascript code and href="http..." (the link to the full-res image), as well as child elements including <img... with its own attribute src="http...". Note that when displayed as content text, this anchor element has the < symbol rather than the escape code &lt;, whereas in the source XML it is the escaped form. So the code for parsing image links is:


def parseImageLinks(s):
links = list()
i = s.find('<a')
while i != -1:
j = s.find('/a>',i+1)
link = s[i+1:j]
if link.startswith('a onblur'):
links.append(link)
i = s.find('<a',i+1)
retL = list()
for link in links:
i = link.find('href')
j = link.find('>')
retL.append(link[i+6:j-1])
return retL


We use the string method "find" to get the indexes flanking the substring of interest. A subtle bug that I had in a previous version was that I failed to specify i+1 in the line:

j = s.find('/a>',i+1)

The result was code that apparently worked, but found only the first image if there was more than one. This happens because, on the second time through the loop, i was correctly set to the start of the second element, but j was still the first end-tag, and with j < i, we get an empty string with s[i+1:j].


>>> s = 'abcde'
>>> s[3:2]
''


The second part of the function finds the href for each link and saves it in a list to return. I end up with a containing the title of each post that has at least one image, and all the image links. I just saved them in a text file.