Thursday, October 22, 2009

XML Reader (2)



What we're going to do here is to convert a Genbank record (received as XML) into a plist with just the values we're interested in, like the example above. The first post in the series is here.

Below is the code for about the first half of the Reader class. We call it by sending it the message read:(NSURL *)xmlURL. Parsing is very straightforward. But how to deal with duplicates?

What I do is when I get parser:didStartElement:..., I check to see if it is already a key in the dictionary. If it is, we convert the key from, say, "FirstName" to "FirstNamei" where i is the number of these things we have including the current one. But this only happens for elements that are multiples. So, the abstract is there under its own key ("Abstract" or "AbstractText") but for multiple authors "FirstName" is set to "multiple" and then there are "FirstName1", "FirstName2", etc. Simple, but effective.

Dealing with Genbank sure is fun. For example, one record has "Forename" as the key, another has "FirstName." Why in the world would they do that?


#import <Foundation/Foundation.h>
#import "Reader.h"

@implementation Reader
@synthesize mD;
@synthesize keyCountD;

- (id)init {
self = [super init];
if (nil == self) return nil;
[self setMD:[NSMutableDictionary
dictionaryWithCapacity:20]];
[self setKeyCountD:[NSMutableDictionary
dictionaryWithCapacity:5]];
return self;
}

- (void)read:(NSURL *)xmlURL{
NSXMLParser *parser;
parser = [[NSXMLParser alloc]
initWithContentsOfURL:xmlURL];
[parser setDelegate:self];
[parser
setShouldResolveExternalEntities:YES];
BOOL success = [parser parse];
if (success) { NSLog(@"yes"); }
else { NSLog(@"no"); }
}

- (void)parser:(NSXMLParser *)parser
didStartElement:(NSString *)elementName
namespaceURI:(NSString *)namespaceURI
qualifiedName:(NSString *)qName
attributes:(NSDictionary *)attributeDict {
//NSLog(@"didStart %@ %@ %@ %@",
//elementName,namespaceURI,qName,attributeDict);
NSString *k = elementName;
int n;
// we check for multiple values
if ([[mD allKeys] containsObject:k]) {
// how many have we seen
n = [[keyCountD
objectForKey:k] intValue];
// special for the first dup item seen
if (n == 1) {
// fetch what's in the dict
id obj = [mD objectForKey:k];
// get new key for this first value and set obj
cKey = [elementName stringByAppendingString:
[NSString stringWithFormat:@"%i",n]];
[mD setObject:obj forKey:cKey];
// mark the original value as multiple
[mD setObject:@"multiples" forKey:k];
}
// now get a value for the current key and count
n = n + 1;
cKey = [k stringByAppendingString:
[NSString stringWithFormat:@"%i",n]];
}
else {
n = 1;
cKey = k;
}
// don't forget to save the count
[keyCountD setObject:[NSNumber numberWithInt:n]
forKey:k];
cValue = [NSMutableString
stringWithString:@""];
}

- (void)parser:(NSXMLParser *)parser
foundCharacters:(NSString *)s {
//NSLog(@"found: %@", s);
[cValue appendString:s];
}

// this function is only called for a terminal leaf !!
// e.g. key 'Author' has no value to set
// but only for new articles!
- (void)parser:(NSXMLParser *)parser
didEndElement:(NSString *)elementName
namespaceURI:(NSString *)namespaceURI
qualifiedName:(NSString *)qName {

//NSLog(@"didEnd %@", elementName);
NSCharacterSet *cs = [NSCharacterSet
characterSetWithCharactersInString:@" \n"];
NSString *s2;
if (!(nil == cKey)) {
s2 = [cValue
stringByTrimmingCharactersInSet:cs];
[mD setObject:s2 forKey:cKey];
//NSLog(@"object %@ for key %@", cValue, cKey);
}
}