Python for Bioinformatics: More about XML

Friday, October 30, 2009

More about XML

This post is in the category of "Google helps Tom get organized." It is so elementary that if you know anything you'd be much better off reading Mark's book Dive Into Python. But, if you know just enough to be dangerous, and are trying to figure out XML for Blogger, you've come to the right place.

XML consists of layers of elements, arranged with a single element as the root, and then various child elements below that, similar to HTML. How is it different? I'm not really sure. I found this on the web:

HTML is about displaying information, while XML is about carrying information

Does that help? No, me neither...

[UPDATE: XML uses tags <stuff>...</stuff> to organize data. They have to be properly "nested," so you can't have the end-tag for the root come before the end, before all of its child elements end-tags have appeared, or the document wouldn't be valid XML. And that's why all of the formatting tags for the HTML that you see in content of a blog-post as XML, have been changed to the escape characters---like < converted to < ]

XML may be formatted to show the structure explicitly, as in Pubmed XML, or not, as in Google Data XML. A formatted version of the data for a blog would look like this:


<feed>
  <child1></child1>
  <child2></child2>
</feed>

Here there is a root element named <feed> with various child elements. The document closes with the end-tag (</feed>) for the root element. In the case of blogger, the child elements are (at least in my case): id, updated, title, link (4 of them), author, generator, and then a bunch of <entry> elements. The first 40 or so entry elements are metadata. It would be nice if they were named differently, but that's how it is.

Any element can have (or not have) child elements, marked with begin-tag and an end-tag, which may themselves be complex. They can also have attributes, one or more name/value pairs typically within the start-tag:


<entry>
  <id>tag:blogger.com,1999:blog-8953369623923024563.post-2518535825745482316</id>
  <published>2009-09-08T05:34:00.000-04:00</published>
  <updated>2009-10-01T10:18:10.585-04:00</updated>
  <category 
    scheme='http://schemas.google.com/g/2005#kind' 
    term='http://schemas.google.com/blogger/2008/kind#post'/>

Here, category is a child of entry, and it has two attributes scheme and term, which can be accessed from a Python dictionary when parsing with element tree. If there are no attributes, the dictionary is empty. Notice the subtle forward-slash in "#post'/>". That's an abbreviated way of doing the end-tag rather than the full: </category>.

Finally, an element may have a text value, called its content.

Using ElementTree

It helps a lot to know what you're looking for. If you don't, you can find out by perusing the XML. In the case of ElementTree, we traverse the tree structure by doing:


item.getchildren()

which returns a list of the child elements. (Item is the variable name I've given to this particular element).

Attributes are found by looking in the attribute dictionary, obtained with

item.attrib

Child elements can also be searched for by name. The following two functions search for the next element or all elements named title, but only at the next level, any sub-levels are opaque.


item.find(t + 'title')
item.findall(t + 'link')

The variable t above is a string we need to add to the search text, and it's explained below. The content of an element is obtained like this:


item.text

Notice the difference between functions, which use the call operator () and objects, which don't.

One confusing thing about the Blogger data is that it is in the "Atom" format, which specifies that the tag of an element, obtained with

item.tag

is always prefaced with its namespace (in this case):

{http://www.w3.org/2005/Atom}

although it doesn't actually appear as such in the XML. So, for example, if you grab bloggerScript.py referenced in the previous post, and do:

print root.tag

The output is not 'feed' but

{http://www.w3.org/2005/Atom}feed

Furthermore, and most important, if you search as we did with "title" above, you must add the namespace tag to the search string. So t has the value:

t = '{http://www.w3.org/2005/Atom}'

Is that clear? One more thing, within the content for a post obtained by:

c = item.find(t + 'content')

and then:

c.text

there can be found the links embedded in the post. There don't seem to be child elements for the content element: getchildren() returns an empty list for the content. The links have to be parsed by hand. But it's easy, more about that next time.