Python for Bioinformatics: XML Reader (2)

Thursday, October 22, 2009

XML Reader (2)

What we're going to do here is to convert a Genbank record (received as XML) into a plist with just the values we're interested in, like the example above. The first post in the series is here.

Below is the code for about the first half of the Reader class. We call it by sending it the message read:(NSURL *)xmlURL. Parsing is very straightforward. But how to deal with duplicates?

What I do is when I get parser:didStartElement:..., I check to see if it is already a key in the dictionary. If it is, we convert the key from, say, "FirstName" to "FirstNamei" where i is the number of these things we have including the current one. But this only happens for elements that are multiples. So, the abstract is there under its own key ("Abstract" or "AbstractText") but for multiple authors "FirstName" is set to "multiple" and then there are "FirstName1", "FirstName2", etc. Simple, but effective.

Dealing with Genbank sure is fun. For example, one record has "Forename" as the key, another has "FirstName." Why in the world would they do that?

#import <Foundation/Foundation.h>
#import "Reader.h"

@implementation Reader
@synthesize mD;
@synthesize keyCountD;

- (id)init {
    self = [super init];
    if (nil == self) return nil;
    [self setMD:[NSMutableDictionary
        dictionaryWithCapacity:20]];
    [self setKeyCountD:[NSMutableDictionary
        dictionaryWithCapacity:5]];
    return self;
}

- (void)read:(NSURL *)xmlURL{
    NSXMLParser *parser;
    parser = [[NSXMLParser alloc] 
        initWithContentsOfURL:xmlURL];
    [parser setDelegate:self];
    [parser 
        setShouldResolveExternalEntities:YES]; 
    BOOL success = [parser parse];
    if (success) { NSLog(@"yes"); }
    else { NSLog(@"no"); }
}

- (void)parser:(NSXMLParser *)parser 
    didStartElement:(NSString *)elementName 
    namespaceURI:(NSString *)namespaceURI 
    qualifiedName:(NSString *)qName 
    attributes:(NSDictionary *)attributeDict {
    //NSLog(@"didStart %@ %@ %@ %@",
        //elementName,namespaceURI,qName,attributeDict);
    NSString *k = elementName;  
    int n;  
    // we check for multiple values
    if ([[mD allKeys] containsObject:k]) {                
        // how many have we seen
        n = [[keyCountD 
            objectForKey:k] intValue];  
        // special for the first dup item seen
        if (n == 1) {
            // fetch what's in the dict
            id obj = [mD objectForKey:k];
            // get new key for this first value and set obj
            cKey = [elementName stringByAppendingString:
                    [NSString stringWithFormat:@"%i",n]];
            [mD setObject:obj forKey:cKey];
            // mark the original value as multiple
            [mD setObject:@"multiples" forKey:k];
        } 
        // now get a value for the current key and count
        n = n + 1;
        cKey = [k stringByAppendingString:
                [NSString stringWithFormat:@"%i",n]];
    }
    else {
        n = 1;
        cKey = k;
    }
    // don't forget to save the count
    [keyCountD setObject:[NSNumber numberWithInt:n]
                  forKey:k];                
    cValue = [NSMutableString 
        stringWithString:@""];
}

- (void)parser:(NSXMLParser *)parser 
    foundCharacters:(NSString *)s {
    //NSLog(@"found: %@", s);
    [cValue appendString:s];
}

// this function is only called for a terminal leaf !!
// e.g. key 'Author' has no value to set
// but only for new articles!
- (void)parser:(NSXMLParser *)parser 
    didEndElement:(NSString *)elementName 
    namespaceURI:(NSString *)namespaceURI 
    qualifiedName:(NSString *)qName {
    
    //NSLog(@"didEnd %@", elementName);
    NSCharacterSet *cs = [NSCharacterSet 
        characterSetWithCharactersInString:@" \n"];
    NSString *s2;
    if (!(nil == cKey)) {
        s2 = [cValue 
            stringByTrimmingCharactersInSet:cs];
        [mD setObject:s2 forKey:cKey];
        //NSLog(@"object %@ for key %@", cValue, cKey);
    }
}