Python for Bioinformatics: Google Data API

Wednesday, October 28, 2009

Google Data API - baby steps

The Google Data API is designed to make it easy to get data from Google and use it in your application. I did not find it easy. In fact, I really haven't figured it out yet.

What I want to do is to archive my web log, eventually in an automated way. I'd like to have html for the individual posts, preferably human-readable html similar to what I originally typed in the the editor window. The Atom feed would be perfect since the format is compact. Then I could parse it somehow and go grab the original images from blogger at full resolution. This is important because I haven't kept copies of those images. They're usually just screenshots that I typically discard.

Perhaps I would try to organize everything into directories, say individual posts grouped by months or by topic, and maybe change the image links so they point to the local copies of the images. Ideas:

• Export from Blogger
the XML isn't displayed properly by Safari

• the Atom feed looks nice, but I haven't figured out how to get all the posts at once. This looks like it works, but then you see it's been broken into two pages:

http://telliott99.blogspot.com/feeds/posts/default?max-results=500

• The standard URL does work

http://telliott99.blogspot.com/search?max-results=500

I can save either of these from Safari as a Web Archive (though that has changed---it is actually in a special Apple format). I can grab the source and save it as HTML, but it's pretty ugly HTML.

Google Data API

This API is designed to make it easy to get data from Google and use it in your application. I did not find it easy but I think I got it working a little bit---baby steps. I would love to find tutorials for this stuff.

I grabbed the Python client library for the Blogger data API and installed as usual. I ran:

./tests/run_data_tests.py

and everything looked fine. I didn't run the sample BloggerExample.py because the version they have requires a log-in and I didn't want to chance screwing up the blog. By digging in the source to try to change that I just got lost. Eventually I found an introductory video at youtube, but it doesn't go far enough. From the video I learned how to do is this:

import gdata.blogger.service,sys
client = gdata.blogger.service.BloggerService()

def test1():
    base = 'http://telliott99.blogspot.com/'
    base += 'feeds/posts/default?'
    url = base + 'max-results=5'
    #url = base + 'max-results=500'
    feed = client.Get(url)
    #print '# entries', len(feed.entry)
    
    for entry in feed.entry[:2]:
        print entry.title.text
        print entry.id.text
        for c in entry.category:
           print c.term
        # it's the last link that matters
        li = entry.link[-1]
        print li.title
        print li.href
    
test1()

After a deprecation warning for use of the sha module, I get:

Hidden Markov Models (1)
tag:blogger.com,1999:blog-8953369623923024563.post-2952608856988019032
bioinformatics
Hidden Markov Models (1)
http://telliott99.blogspot.com/2009/10/hidden-markov-models-1.html

The way of the program
tag:blogger.com,1999:blog-8953369623923024563.post-5554856078488748140
thinking aloud
The way of the program
http://telliott99.blogspot.com/2009/10/way-of-program.html

I suppose the numbers are ids for the blog and individual posts

So now we need to go farther... After looking more carefully (patiently) at the instructions here, I see that what I'm supposed to do this:

http://www.blogger.com/feeds/profileID/blogs

Where the profileID is obtained from the URL displayed when I click to display my profile.

def test2():
    base = 'http://www.blogger.com/feeds/'
    profileID = '01151844786921735933'
    url = base + profileID + '/blogs'
    feed = client.Get(url)
    print len(feed.entry)
    
test2()

This just prints the number of blogs I have!

By reading more in the instructions, I finally got some real data:

def test3():
    base = 'http://www.blogger.com/feeds/'
    blogID = '8953369623923024563'
    url = base + blogID
    url += '/posts/default?max-results=500'
    feed = client.Get(url)
    print '# entries', len(feed.entry)
    e = feed.entry[0]
    print e.title.text
    print e.id.text
    #print e.content.ToString()
    print dir(e)
 
test3()

Output from the dir call includes: 'author', 'category', 'content', 'contributor', 'control', 'extension_attributes', 'extension_elements', 'id', 'link', 'published', 'rights', 'source', 'summary', 'text', 'title', 'updated'.

What I need to do:

• Figure out the URL to send to request a particular entry
• Figure out how to work with the xml data format I'm getting