Python for Bioinformatics: simple C example 1

Tuesday, December 28, 2010

simple C example 1

This post is in the category of notes to myself. Unless you're a complete newcomer to programming, it probably won't help you. But I hope it will help me to remember the essentials.

The smallest fundamental data type in C is the char. A char, whether signed or unsigned, occupies one byte (8 bits). We can assign to a char variable using either a single-quoted text character like 'a' or a numeric constant, whether decimal, octal or hexadecimal:

char c;
c = 'a';
c = 97;
c = 0141;
c = 0x61;
printf("%c\n", c);

The call to printf (declared in stdio.h) will print 'a' for each of the 4 assignments.

For the example I'm showing here, the #include statements we need are:

#include <stdio.h>
#include <string.h>
#include <math.h>

These give access to the printf, strlen and pow functions, which are declared in these .h files and defined elsewhere.

Our approach to examine the individual bits of a char is to use a combination of the bitwise AND operator (&) and a bit-shift operation, as shown in the following code fragment:

void print_byte(unsigned char x) {
    int i, value = 0;
    char A[8];
    // read lowest to highest
    for (i = 0; i < 8; i++) {
        if (x & 01) { A[i] = '1'; }
        else        { A[i] = '0'; }
        x >>= 1;
    }

We examine the least significant bit of x by doing a bitwise AND using x and the octal constant 01 (00000001 in binary). We get each bit in turn by doing a rightward bit-shift operation, and assign the result directly to x. Since this approach reads the bits from right to left, I decided to just save the characters '0' and '1' to an array and then print them in reverse order.

Another approach would be to test the high value bit and print it directly. In order to do that, you would use this test: if (x & 0x80), and do a leftward bit-shift x <<= 1.

    int value = 0;
    // print highest to lowest
    for (i = 7; i > -1; i--) {
        printf("%c", A[i]);
        if (A[i] == '1') {
            value += pow(2,i);
        }
    }
    printf("  %3d  ", value);
}

We use the pow (power or exponentiation) function from the math library to convert the value of the byte to decimal.

A second function applies the first one successively to the chars of a C string variable, defined as shown below, in main.

The variable that we'll print is a pointer to char or char *, that is, a variable which when de-referenced gives us back a char. So the string is actually an array of chars. It is distinguished by the double quotes and a termination marker. In C you normally need to know how many values are present in an array in order not to do something stupid, but in the case of a string the convention is to terminate with the \0 character. The code for print_word is:

void print_word(const char *s) {
    int j;
    char z;
    printf("word: %s\n", s);
    for (j = 0; j < strlen(s); j++) {
        z = s[j];
        if (z != '\n') { 
            printf("%c ", z);
            print_byte(z);
            printf(" %p  \n", &s[j]);
        }
    }
}

int main() {
    const char *greeting = "Hello world!\0";
    printf("greeting:  %s\n", greeting);
    print_word(greeting);        
    return 0;
}

We combine these four sections of code in a file test.c and do:

$ gcc test.c -o test
$ ./test
greeting:  Hello world!
word: Hello world!
H 01001000   72   0x100000e96  
e 01100101  101   0x100000e97  
l 01101100  108   0x100000e98  
l 01101100  108   0x100000e99  
o 01101111  111   0x100000e9a  
  00100000   32   0x100000e9b  
w 01110111  119   0x100000e9c  
o 01101111  111   0x100000e9d  
r 01110010  114   0x100000e9e  
l 01101100  108   0x100000e9f  
d 01100100  100   0x100000ea0  
! 00100001   33   0x100000ea1

For each character of the string we print the binary representation, the decimal equivalent, and the position of that char in memory. That last comes from here:

printf(" %p  \n", &s[j]);

So we can see that the individual chars of the string are laid out in successive bytes in memory.

This is all quite fundamental. The parts I have trouble remembering when I come back to C after a long absence are:

- the distinction between single and double quotes
- one can access the individual chars of a string and assign them to chars
- the formatting codes (% plus c, s, i, f, p etc.)
- remembering to #include
- remembering to add parentheses around this test: if (A[i] == '1')
- remembering to add brackets around the entire group of conditional statements in:
if (x) { a; b; c; }

My usage of brackets is non-standard, but I think it makes sense. I always start them at the end of a line, and terminate at a position spaced out to the first character of the line where they started (except when the expression is really short).

You'll notice const and unsigned modifiers in the code above, but they're not essential.

I think that's it. In the next post, I'll develop this into a multi-file project and introduce the make tool.