Python for Bioinformatics: Pubmed speaks XML

Thursday, October 22, 2009

Pubmed speaks XML

At times in the past, I've used Python for querying Genbank and Pubmed. The most reliable way to do that involves XML. I don't have the link any more, but I remember reading at NCBI something like this: "we reserve the right to do whatever the heck we want with our HTML---your code will probably break and we don't care. You should use XML instead."

So, if I'm ever going to connect Instant Cocoa and bioinformatics, I have to learn how to parse XML. I posted previously (for example, here) about using Python and elementTree for this. Now, I want to try it in Cocoa.

I wrote a class called "Reader." (This is a preliminary attempt, the result of a few hours of work). Reader implements the NSXMLParserDelegate Protocol, which consists of three methods:
parser:didStartElement:namespaceURI:qualifiedName:attributes: parser:foundCharacters: parser:didEndElement:namespaceURI:qualifiedName:

The idea is that when your class receives the first delegate message, you get a string as the argument to didStartElement: which is the tag for the thing the parser is going to try to read next. What you should do at this point is to initialize an empty NSMutableString to accumulate the strings that will be fed to you by the second method. In my tests, the abstract was sent as one chunk of text, but there are no guarantees. When the entire element has been parsed and its end tag encountered, the third method is called.

One problem we will need to deal with is duplicate tags. For example, in a PubMed record for a paper with multiple authors, you will receive "LastName" for each one of them. These have to be kept straight.

Also, of course, you have to know what you're looking for. And if you want to keep more information (like the path that leads to a particular node), you'll have to do it yourself. At least the way I've initialized the parser, I'm not receiving anything for the other arguments like qualifiedName and attributes. You will only receive data for terminal nodes.

Here is the code for Reader.h:

#import <Foundation/Foundation.h>

@interface Reader : NSObject 
 {
    NSString *cKey;
    NSMutableString *cValue;
    NSMutableDictionary *mD;
    NSMutableDictionary *keyCountD;
}
- (void)read:(NSURL *)xmlURL;
- (void)reportAll;
- (NSMutableArray *)sortedKeys;
- (NSMutableDictionary *)processEntries;

- (void)parser:(NSXMLParser *)parser 
    didStartElement:(NSString *)elementName 
    namespaceURI:(NSString *)namespaceURI 
    qualifiedName:(NSString *)qName 
    attributes:(NSDictionary *)attributeDict;
    
- (void)parser:(NSXMLParser *)parser 
    foundCharacters:(NSString *)s;

- (void)parser:(NSXMLParser *)parser 
    didEndElement:(NSString *)elementName 
    namespaceURI:(NSString *)namespaceURI 
    qualifiedName:(NSString *)qName;
    
@property (retain) NSMutableDictionary *mD;
@property (retain) NSMutableDictionary *keyCountD;

@end