I wrote a script that I explained briefly in a previous post (download here: bloggerScript.py), which uses ElementTree in Python to parse content from a blogger XML archive for image links. If you're editing a post, or just using your browser in the normal way and looking at a page's source which contains an image, you'll see something like
<a onblur="try ...
The "a" element's name is short for "anchor", that is, a hyperlink. It has two "attributes" (I assume they are named the same in HTML as XML): some javascript code and href="http..." (the link to the full-res image), as well as child elements including <img... with its own attribute src="http...". Note that when displayed as content text, this anchor element has the < symbol rather than the escape code <, whereas in the source XML it is the escaped form. So the code for parsing image links is:
We use the string method "find" to get the indexes flanking the substring of interest. A subtle bug that I had in a previous version was that I failed to specify i+1 in the line:
j = s.find('/a>',i+1)
The result was code that apparently worked, but found only the first image if there was more than one. This happens because, on the second time through the loop, i was correctly set to the start of the second element, but j was still the first end-tag, and with
j < i
, we get an empty string with s[i+1:j]
.The second part of the function finds the href for each link and saves it in a list to return. I end up with a containing the title of each post that has at least one image, and all the image links. I just saved them in a text file.