Debugging Memory Errors in C/C++

This page describes a few key techniques I've learned about how to debug programs that are suspected of containing memory errors. Principally, this includes using memory after it has been freed, and writing beyond the end of an array. Memory leaks are considered briefly at the end.

It's of course rather presumptuous to even write these up, since so much has already been written. I'm not intending to write the be-all and end-all article, just to write up a few of the techniques I use since I recently had the opportunity to help a friend debug such an error. There's also some links at the end to other resources.

Note that I'm only interested here in memory errors that trash part of the heap. Overwriting the stack may be a cracker's favorite technique, but when it happens in front of the programmer it's usually very easy to track down.

Why are memory errors hard to debug?

The first thing to understand about memory errors is why they're different from other bugs. I claim the main reason they are harder to debug is that they are fragile. By fragile, I mean the bug will often only show up under certain conditions, and that attempts to isolate the bug by changing the program or its input often mask its effects. Since the programmer is then forced to find the needle in the haystack, and cannot use techniques to cut down on the size of the haystack, locating the cause of the problem is very difficult.

Consequently, the first priority when tracking down suspected memory errors is to make the bug more robust. There is a bug in your code, but you need to do something so that the bug's effects cannot be masked by other actions of the program.

Making the bug more robust

I know of two main techniques for reducing the fragility of a memory bug:

Don't re-use memory.
Put empty space between memory blocks.

Why do these techniques help? First, by not re-using memory, we can eliminate temporal dependencies between the bug and the surrounding program. That is, if memory is not re-used, then it no longer matters in what order the relevant blocks are allocated and deallocated.

Second, by putting empty space between blocks, overwriting (or underwriting) past the end of one block won't corrupt another. Thus, we break spatial dependencies involving the bug. The space between the bugs should be filled with a known value, and the space should be periodically checked (at least when free is called on that block) to see if the known value has been changed.

With temporal and spatial dependencies reduced, it's less likely that a change to the program or its input will disturb the evidence of the bug's presence.

Of course, your machine must have enough spare memory to run the experiment. But, by making the bug more robust, we can now cut down on the input size! Thus in the end using more space in the short term can lead to using less space in the final, minimized input test case.

The above two techniques are easily implemented in any debug heap implementation. I've modified Doug Lea's malloc to implement the features; my modified version is here: malloc.c, ckheap.h. To compile with the debug features described, set the preprocessor variables DEBUG and DEBUG_HEAP. But of course you can use any implementation, and the debug versions can simply be wrappers around the real malloc.

Using hardware watchpoints

Intel-compatible x86 processors include debug registers capable of watching up to four addresses. Whenever a read or write to any of the watched addresses happens, the program traps, and the debugger gets control. The debug registers offer a powerful way to find out what line of code is overwriting a given byte, once you know which byte is being overwritten.

In gdb, the notation for using hardware watchpoints is a little odd, because gdb likes to think of its input as a C expression. If you want to stop when address 0xABCDEF is accessed, then at the gdb prompt type

  (gdb) watch *((int*)0xABCDEF)

One difficulty is that you can't begin watching an address until the memory it refers to has been mapped (requested from the operating system for use by the program). The usual solution is to step through the program at a rather coarse granularity (skipping over most function calls) until you find a point in time where the address is mapped but has not yet been trashed. Add the watchpoint, then let the program run until the address is accessed.

An example

Suppose I have a program with a suspected memory error. I compile it with the debug malloc.c, and when I run it I see:

  $ ./tmalloc
  trashed 1 bytes
  tmalloc: malloc.c:1591: checkZones: Assertion `!"right allocated zone trashed"' failed.
  Aborted

I first run the program in the debugger to find the offending address:

  (gdb) run
  Starting program: /home/scott/wrk/cplr/smbase/tmalloc
  trashed 1 bytes
  tmalloc: malloc.c:1591: checkZones: Assertion `!"right allocated zone trashed"' failed.

  Program received signal SIGABRT, Aborted.
  0x400539f1 in __kill () from /lib/libc.so.6
  (gdb) up
  #1  0x400536d4 in raise (sig=6) at ../sysdeps/posix/raise.c:27
  27      ../sysdeps/posix/raise.c: No such file or directory.
  (gdb) up
  #2  0x40054e31 in abort () at ../sysdeps/generic/abort.c:88
  88      ../sysdeps/generic/abort.c: No such file or directory.
  (gdb) up
  #3  0x4004dfd2 in __assert_fail () at assert.c:60
  60      assert.c: No such file or directory.
  (gdb) up
  #4  0x8048d55 in checkZones (p=0x8050838 "\016\001", bytes=270)
      at malloc.c:1591
  (gdb) print p[bytes-1-i]
  $1 = 7 '\a'                 <----- trashed! should be 0xAA
  (gdb) print p+bytes-1-i
  $2 = (unsigned char *) 0x80508c6 "\a", '\252' <repeats 127 times>
  (gdb)                  ^^^^^^^^^
                         this is the trashed address

Now I restart the program and attempt to set a hardware watchpoint:

  (gdb) break main
  Breakpoint 1 at 0x8048b91: file tmalloc.c, line 81.
  (gdb) run
  The program being debugged has been started already.
  Start it from the beginning? (y or n) y

  Starting program: /home/scott/wrk/cplr/smbase/tmalloc

  Breakpoint 1, main () at tmalloc.c:81
  (gdb) watch *((int*)0x80508c6)
  Cannot access memory at address 0x80508c6
  (gdb)

Ok, the memory isn't mapped yet. Single-stepping through main a few times, I find a place where I can insert the watchpoint but the memory in question hasn't yet been trashed. When I then continue the program, the debugger next stops at the bug.

  (gdb) watch *((int*)0x80508c6)
  Hardware watchpoint 3: *(int *) 134547654
  (gdb) c
  Continuing.
  Hardware watchpoint 3: *(int *) 134547654

  Old value = -1431655766
  New value = -1431655929
  offEnd () at tmalloc.c:33
  (gdb) print /x -1431655766
  $1 = 0xaaaaaaaa              <--- what it should be
  (gdb) print /x -1431655929
  $2 = 0xaaaaaa07              <--- what it became after trashing
  (gdb) list
  28
  29      void offEnd()
  30      {
  31        char *p = malloc(10);
  32        p[10] = 7;    // oops       <--- the bug
  33        free(p);
  34      }
  35
  36      void offEndCheck()
  37      {
  (gdb)

In this small program the bug would have been obvious upon inspection, but the technique of course generalizes to cases that are much more complicated.

Dangling references

As mentioned above, a debug heap shouldn't re-use memory. Going one step further, my debug malloc.c overwrites free()'d memory with another known pattern (but does not actually free it). Then, if the program continues to use the memory the mistake will become clear, especially if it tries to interpret the values it finds as pointers (they'll segfault). Double-deallocation is also easy to identify with this scheme.

Memory leaks

I usually debug memory leaks by printing statistics about calls to malloc and free before and after certain sections of code. If there are more calls to malloc, but the code isn't supposed to be creating long-lived data, then that points to a potential problem. This doesn't easily generalize to long-running programs, but if the program can be broken into units and the leak properties of each unit checked in isolation, most leaks can be found relatively easily.

Conclusion

The C and C++ languages are much-maligned for lack of memory safety, but too often this is seen as a greater problem than it really is (setting security issues aside for the moment). Debugging memory requires a different approach than debugging other kinds of errors, but with a little practice they can actually be easier and faster to find, simply because the same techniques (and tools!) can be used over and over.

Some links

I'm not the first or last to write about methods for debugging memory errors. Here are some links to other people who also aren't the first or last either (actually only the first link really matches this description..).

Debugging Tools for Dynamic Storage Allocation and Memory Management: Ben Zorn's long list of tools people have written to help debug memory errors.
Doug Lea's malloc: Doug Lea's implementation of malloc.
malloc.c: My modified version of Doug Lea's malloc, version 2.7.0. I've added:
- -DDEBUG_HEAP: don't re-use memory, put empty zones on both sides of allocated space, overwrite deallocated space
- Statistics to track the number of calls to malloc and free.
- A heap walker interface.
- -DTRACE_MALLOC_CALLS: print a message to stderr on every malloc and free
The above malloc.c also needs the header ckheap.h. That's an oversight I plan to correct, but in the meantime this should be enough to compile malloc.c.
gdb: The GNU debugger. The de-facto standard on Linux, for better or worse.
Rational: The makers of Purify, one of the best-known tools for finding memory errors. Purify doesn't require recompiling the program, which certainly has its advantages, but as such it is limited in the ways it can make memory bugs more robust. I think sometimes people reach for a heaviweight solution like Purify when a simple debug heap would be faster and easier.
CCured: I'd be remiss if I didn't mention CCured, a research project I've done quite a bit of work on. CCured instruments the entire program so it can catch a wide variety of bugs, in a way that is sound: if CCured does not report a problem, then no problem occurred during that run of the program. I can't recommend it as the first solution to reach for during debugging, since it takes a fair bit of time and effort to get a program working under CCured. But in the long run, if you can use CCured, it provides a level of assurance well beyond that of any other current technique.