Numerology

I’m technically on break from school now, although I’m still putting the final polish on one last project. But things have slowed down enough that I can bore you with technical crap again.

There’s an interesting section of TPOP wherein the authors discuss studying the “numerology” of a bug; that is, look for patterns in the data instead of just looking at the code. I’ve had a lot of practice with this idea lately when trying to decipher kernel crashes with nothing but an Oops message to go on (these being much easier than infrequent lockups with no output at all!). It’s quite useful for userspace as well, if you are using a language where memory corruption is a fact of life. To show why, here’s a lengthy anecdote about how I recently used the power of man ascii to find and fix a memory corruption bug in about five minutes while racing to meet a project deadline.

First, some background: in the kernel, linked lists are implemented by embedding the next and previous pointers directly into the linked object, by way of the struct list_node. For example:

struct list_node {
    struct list_node *next;
    struct list_node *prev;
};

/* one item in a list */
struct item {
    int id;
    struct list_node list;
};

/* some other object that has a list of items */
struct bar {
    struct list_node item_list_head;
};

This is a curious construction, but once you get used to it, it’s quite handy — you can embed an object in multiple simultaneous lists without having to worry about the lifetime of void pointers, and without having to malloc container structures all the time. The list code is object-agnostic, operating only on the list_node structures, but of course you need a way to get from a struct list_node back to the object that contains it. For that, there is the container_of macro and friends. These are just C macros that, given the address of the struct list_node, the variable name of the structure, and the type of the item in which the struct is embedded, finds the start of the containing structure with a little pointer arithmetic.

I had implemented a similar set of list routines for a userspace program and started getting weird crashes, so I fired up gdb to see something like the following:

gdb > p *item
$1 = {id = 134269966, item_list = {next = 0x804b008, prev = 0x804b008},
      global_list = {next = 0x72657375, prev = 0x40313030},
      src = "machine.org", '00' <repeats 69 times>}

Had I taken a minute to look at this in detail, I would have noticed that the value of id was quite higher than it should be, and that the leading part of src was truncated. But what jumped out at me immediately was that global_list.next and global_list.prev were obviously wrong pointer values. In Linux on x86, heap addresses are usually above 0x0804b000, and stack addresses are below 0xbfffc000. Clearly, the global_list pointers don’t meet these ranges. However, the item_list.next and item_list.prev addresses looked fine. So either the global list pointers weren’t being initialized, or they were being corrupted somehow. Sometimes that kind of corruption happens if the list pointers aren’t updated properly, causing the code to follow a next pointer into random memory, but then it would be unlikely that the item_list pointers and src had meaningful values. Plus, I unit-tested the list routines and was pretty confident they were working correctly.

My initial theory was that something was overwriting the global_list struct, and the byte values looked to be in the ascii range. So I turned to the ascii(7) man page, which gives you a nice little table of hex ordinal to character value mappings. Looking at the bytes in turn gives: 0x72 = ‘r’, 0x65 = ‘e’, 0x73 = ‘s’, 0x75 = ‘u’, 0x40 = ‘@’, 0x31 = ‘1’, 0x30 = ‘0’, 0x30 = ‘0’. Since x86 is little-endian, this works out to the string “user001@”, which should have been the first part of src. The bug must be where the code filled in the src string. However, that code looked fine. I then decided that the pointer to the entire structure must be offset somehow. Sure enough, when I looked at the surrounding code, I saw something like this:

    struct list_node *n;
    struct item *item;

    for (n = list_start(&item_list_head); n;
         n = list_next(&item_list_head, n))
    {
        item = container_of(n, struct item, global_list);
        /* strncpy and other stuff ... */
    }

Oops. I had passed the wrong list name to container_of(), with the effect that the item pointer was actually 8 bytes prior to the real start of the structure. As a result, the string copy overwrote the pointer values of just one of the list structures, a bug which didn’t actually have a negative effect until later.

There are morals a-plenty, including “macros are evil” and “this is why templates were invented.” Yet, I’ve seen Linus use this technique more than once when others were stuck on a kernel memory corruption bug, so I’ll go with “Kernighan and Pike are always right[1].”

[1] Except, I’m still not sold on using capital letters in Go. Perhaps in time I can be convinced.

2 Replies to “Numerology”

  1. Yep, that too. In fact reading the ascii values in this case is somewhat contrived, though it is in fact what I did first just because I was looking for the excuse to do so.

Comments are closed.