Python for Bioinformatics: simple C example 4: file read

Sunday, January 2, 2011

simple C example 4: file read

This post describes basic use of fscanf to read data in from a file. A detailed manual page about the function is here.

The usual approach is that we must first pre-allocate storage of the appropriate type for the data we expect (at least, if we wish to save it for further manipulation). So far I've used char, int and double data.

A difficulty is that we usually don't know how much data we will read from the file. For the sites program (here), when we read scores or counts, the file contains an integer as the first value and that specifies how much data is present. But in general we don't know, so we'll have to allocate storage as we go.

fscanf takes a pointer to the storage as an argument. We can either do this directly, or pass the address of a variable of the correct type. There are lots of options for fscanf, including the ability to read data in larger chunks, to read data of different types (in a specified order), or to skip certain characters, but I'm not going to worry about those complications here.

In the first example, we don't save the data, just read it and echo to stdout. The file 'lorem.txt' is the default source, but an alternate can be specified on the command line.

example1.c:

#include <stdio.h>
#include <stdlib.h>

int main(int argc, const char* argv[]) {
   const char *ifn;
   if (argc > 1) { ifn = argv[1]; }
   else          { ifn = "lorem.txt"; }
   FILE *ifp = fopen(ifn,"r");
   if (ifp == NULL) {
       printf("open file failed with:  %s\n", ifn);
       exit(EXIT_FAILURE);
   }
   char c;
   int result;
   result = fscanf(ifp,"%c",&c);
   if (result == -1) {
       printf("file read failed with:  %s\n", ifn);
       exit(EXIT_FAILURE);
   }
   int count = 0;
   while (result != -1) {
       if ((count > 40) && (c==' ')) {
           printf("\n");
           count = 0;
       }
       else {
           printf("%c", c);
           count++;
       }
       result = fscanf(ifp,"%c",&c);
   }
   printf("\n");
   return 0;
}

output:

$ gcc example1.c -o test
$ ./test
Lorem ipsum dolor sit amet, consectetur adipisicing
elit, sed do eiusmod tempor incididunt ut
labore et dolore magna aliqua. Ut enim ad
minim veniam, quis nostrud exercitation ullamco
laboris nisi ut aliquip ex ea commodo consequat.
Duis aute irure dolor in reprehenderit in
voluptate velit esse cillum dolore eu fugiat
nulla pariatur. Excepteur sint occaecat cupidatat
non proident, sunt in culpa qui officia deserunt
mollit anim id est laborum.

We read integers (or floats) by appropriate changes to the arguments to fscanf:

    int value;
   int result;
   result = fscanf(ifp,"%d",&value);

In the third example, we're reading the DNA sequence of E. coli (more than 4 million nucleotides). A rather crude approach is to just allocate an array of sufficient size (N = 5000000 works). A more flexible and less wasteful implementation would appropriately scale up the memory usage as needed. We substitute the following for the second half of the code above (starting at int count = 0;), reading just the first 240 nt:

    int N = 240;
   char *buffer = (char *) malloc (N+1);
   if (buffer == NULL) {
       printf("not enough memory\n");
       exit(EXIT_FAILURE);
   }
   int i=0;
   char *p = buffer;
   *p = c;
       while ((result != -1) && (i < N)) {
        p++;
        i++;
        result = fscanf(ifp,"%c", p);
    }
    int count = i;
    for (i=0; i < count; i++) {
        if ((i) && (!(i%60))) {
            printf("\n");
        }
        printf("%c", buffer[i]);
    }
    printf("\n");
    return 0;
}

Output:

$ gcc example3.c -o test
$ ./test ECsequence.txt
agcttttcattctgactgcaacgggcaatatgtctctgtgtggattaaaaaaagagtgtc
tgatagcagcttctgaactggttacctgccgtgagtaaattaaaattttattgacttagg
tcactaaatactttaaccaatataggcatagcgcacagacagataaaaattacagagtac
acaacatccatgaaacgcattagcaccaccattaccaccaccatcaccattaccacaggt

Of course, the appropriate source file must exist for this to work.

Zipped project files on Dropbox (here)