Previous Entry Index Next Entry

Zzz: June 6, 2008

<-- -->

Symptom 2

Many kinds of software have characteristic failure modes. When I worked at a game company, one sign of a flaky program was magenta suddenly showing up everywhere, because we had the zero index mapped to magenta, and errors in pointer and address calculations would leave the video hardware pointing at blank areas of memory, which had been wiped to all-zeroes to make debugging more predictable. Unix programs don't usually do this because the system uses memory as it goes, and a reference to unused memory is caught by the operating system.

At my current job, I periodically spend my evenings looking at stack traces and memory dumps from crashes, and have gotten far, far too good at teasing out the information that I need. One telltale sign, though, tends to escape me until I realize, hours later, that I might have inferred the nature of the problem much more quickly.

So here's a crash dump. Some STL code is dancing around one of its data structures, hits a bad pointer and crashes. Sometimes I wish there was something more constructive to be done than just bail out when that happens; printf() these days will tolerate null strings, and if offered some feature to efficiently determine if a given pointer reference is legal, I for one would not say no. Back in the real world, I notice that rather than some pseudo-random large value, the bad pointer is merely 2.

It used to be that programs started at some quite low address, and frequently the very lowest addresses were valid, which made such problems manifest quite differently, but in the fullness of time it became clear that if you accidentally stored a shoe size or some such thing where a pointer should be, it was better to get nailed with an illegal memory reference than to have the program try to interpret a shoe size as an employee name, a street address, or a picture of a pretty flower, so these days the lower addresses are all marked as out of bounds, so any valid pointer address will be a big one.

Many programs will tend to have, more or less by accident, a favourite number, one that you see again and again. Say for instance that you have a database of various different products, and there's a numeric code for shoes, hats, jackets, and so on, and say that you wind up doing amazing business in jackets but not so much in other things. Eventually the database is going to be full of the number for "jacket", and any program operating on that database will see it a lot. As it happens, our program's favourite number is 2. As with the example of jackets I just gave, this has nothing to do with the program itself, but is instead an artifact of how people use it most of the time.

When you see the program's favourite number appearing where it shouldn't, and you've debugged the same problem enough times, seeing it in a crash dump instills a certain suspicion that it got there because the program was working with input data when things went wrong. And it is not much of a walk from there (suspicion being intellectual and preferring to go about on foot) to considering that perhaps something had gone wrong with the input data itself. At this point the jacket example starts to look a bit weird, since bad input isn't supposed to corrupt a database, but then, bad input isn't supposed to mess up any normal program, so I'm sticking with it. If you find your database filled with a hundred customers all named Mr. Jacket, living at number jacket on Jacket Street, in the city of Jacket, Jacket, reachable by phone at (JAC) KET-JACK, extension ETJA, then you may with some justice suspect that someone tried to enter a big sale of a few hundred different jackets in entirely the wrong way.

So it is with our program and the number 2, but so far I haven't twigged to the possible connection with something amiss in the outside world, and have only realized what happened after a few hours of gruelling detective work, fishing information out of memory dumps and assembly code. I hope the next time I get called in on a four-alarm emergency that is just a customer making some common mistake, that I remember a little earlier that a pointer value of 2 is not merely wrong, but a clue all by itself.

Previous Entry Index Next Entry