C++, Memory Corruption and Crashes! Oh my!

A story travelling through the wild lands of embedded C++, following mystery crashes, memory leaks, uninitialized pointers and compiler bugs

Once upon a time, I graduated from university with not a trace of C++ knowledge under my belt, and promptly got a job writing C++. The codebase was running piggy-backed on a Sierra Wireless SL6087 modem, which gave us roughly 0.5 second time slices where we could run our code before we needed to hand back control to the modem and wait for it to call our code’s entry point again. This, naturally, gave rise to a lot of state machines. We had about 2MB of RAM to play with, a 2G modem which we could send AT commands to, various I2C, UART and SPI interfaces, and a few GPIOs we could wiggle. The UART was connected to a little printer, and the SPI to another microcontroller (a venerable Motorola MCS08GT32) which handled a keypad, some buttons, power etc.

At the time when I joined, the product had a significant uptime deficiency. The initial contractors who had written the codebase were more fluent in Java than C++, and this lead to a number of Java-isms that didn’t work particularly well in a constrained memory environments, such as missing deletes for allocated memory. There had already been a lot of work in improving the uptime, which had gone from ~20 minutes to about 2 hours, before the device would fail with, usually, an OOM and do a hard reset. This was particularly annoying if it happened while the user was doing something. As a result, a lot of work had been done in making sure that a crash at any point in time resulted in certain states not being re-run (eg. the deduction in balance from the customer’s account), while other things always reran (eg. printing out a receipt with a redemable voucher code).

We scratched our heads for a good while, and eventually received the suggestion from Paul Butcher (who was on the board of directors) to give NVWA a try (it was hosted on SourceForge back then). This was a bunch of macros and some tracking code, which overrode all instances of new, malloc/alloc, delete and free. It then did its own memory tracking on top of the normal allocator, and had a special function which would let you get information about what line of code called every single currently allocated chunk of memory. The overhead was pretty high, but this quickly let us track down a few remaining memory leaks (which came, invariably, from unclear contracts in APIs between differenc classes on who owned memory…) and increase the uptime to, on average, about 8 hours.

Now, what was different about the new uptime was that it seemed to be completely random. Sometimes it would crash within a minute of startup, and sometimes it would run for days. We had hundreds of devices in the field which meant that after a software rollout we could get stats pretty quickly, but really this pointed to a different paradigm from what we had before. During lengthy investigations we found one instance of something like:

void internalApi(APIStruct* callInfo);

void doThing()
{
    APIStruct* info;
    internalApi(info);
}

Fixing this to be

void internalApi(APIStruct* callInfo);

void doThing()
{
    APIStruct info;
    internalApi(&info);
}

made us feel a lot better, but didn’t have a noticeable impact on uptime. (It turned out that this was so rarely called that it probably didn’t matter).

Then came the bad times. As we were continuously rolling out new code, we saw memory corruption shift around into various places. At some point it touched something very dear to us (customer account balances), and this meant any memory corruption would start to cause the Who Wants To Be A Millionaire game for our end-users. Obviously, not where we wanted to be. We started writing various ProtectedInteger and ProtectedLong classes that kept duplicates of their values and would check their values at any read or write event. We instrumented our code to death. And, somehow, the problem kept eluding us. It was around this point that we noticed that turning on the NVWA builds would cause the crashes to occur much more frequently. It was also around this time that I discovered toolchains.

Desperately, I spent a weekend upgrading the toolchain from the “ARM_GCC_ELF” which was based on some ancient GCC 3.something, up to the newest “ARM_EABI_GCC” based on GCC 4.something. Along the way I discovered many issues from compiler warnings, and even errors, that needed to be squashed, but on Monday when I finally had a working build, suddenly our uptime was as long as the device’s battery stayed charged. I think from that build we even got to several months on a device that didn’t have internet so didn’t get any upgrades. But, from that point forward, we knew that any crash was due to something we’d done, and I don’t think we had an issue with crashing after that. So yeah, it was a compiler bug, it seems.

Debugging crashes in a large codebase is really hard. From this experience, and comparing it to later adventures, it is much easier to make sure you don’t have crashes in a small codebase, and incrementally grow it from there while making sure the crashes aren’t introduced.

Of course, 1 year later the SL6087 got an end-of-life notification and we redesigned the whole product from scratch. It probably would have been better to do that sooner, but you never know that at the time.