Last time I talked about looking at the Google Data API to archive my blog posts. Of course, there is a download link on the blog under the Settings tab, but I didn't like the look of the XML I get from them. However, I realized that (naturally enough) I get exactly the same data through the API. So I stopped with Google Data, and started trying to parse what I have. I found Mark Pilgrim's book very useful here, not only because it's so clearly written, but it's directly on point.
ElementTree is used to parse the XML. According to the Atom specification, the root element is a "feed", which is qualified with its "namespace." This is its "tag":
The child elements of the feed are varied. These are their tags (minus the namespace):
• link (x 4)
• entry (many)
You might think when you got down to an entry it would be one of the blog entries. But nope. It's still metadata... Each entry has a child (id) with a long prefix and then its text value is BLOG_PUBLISHING_MODE or BLOG_NAME and so on.
The first authentic post is entry number 58. I distinguish the real entries from metadata by testing whether last character of the id is a digit, as in:
An item, whether it's an entry or metadata or a child of an entry, may have attributes, stored in a Python dictionary. It may also have a text value, or not. I wrote a script to look through all this stuff (bloggerScript.py).
This post is long enough that I think I'll quit here and talk about parsing for the URL for each of my images another time. Here is the output for one element: