Shadow of a bug

[[!tag programming obnam btree]]

I've recently spent five days debugging a problem in obnam, my backup program. It turned out to be a bug in my B-tree implementation, which I wrote for obnam. The process was pretty long, and it might be instructive to de-brief myself.

The bug manifested itself when I ran a backup test that was larger than what I had ran before: one terabyte of live data. The symptom was a key lookup problem in a B-tree.

Good: the problem was entirely deterministic, and easy to reproduce.

Bad: it took hours to reproduce it.

My first step was to fiddle with the obnam parameters to see if I could make it fail faster, but in the same way. I got the time down to half an hour.

After this, I started fiddling with the code to see if I can figure out what happens, and perhaps find a quick fix to the bug. After some hours, I realized I was being stupid. I was making essentially random changes to the code, instead of hunting down the actual cause. I had no idea what was going wrong, so trying to fix it was an act of utter stupidty. So I went back to where I had started from.

The next step was to run "obnam fsck" on the failing backup store. This did not find any problems. The fsck I've written is a bit simplistic, and merely checks that for every client in the backup store, every generation is found, and within every generation each file's metadata, and data chunks are found. I will improve fsck in the future, but right now it was not helpful.

I did not have a tool to verify that the actual B-trees were OK. So I wrote one. It, too, is quite simplistic, but at least it proved that the key was still present in the on-disk version of the B-tree.

At this point, I did a code review of the B-tree code. It had been a some months since I'd touched the code, and hence the review brought up a bunch of minor issues: outdated comments, badly formatted code, stupid variable names, etc. One or two buglets, too, but nothing relevant to the case of the missing key.

At this point I made a mistake: I concluded that the error was not in the B-tree code, but the obnam code using the B-trees. I spent the next couple of days running the test case, adding more logging, and analyzing the results. Without my mistake, I would not have ignored indicators that would have led me faster into the depths of the B-tree code. I might have saved a day or two of the process.

Eventually I narrowed down the problem to a single B-tree insertion call in the obnam code. The call inserted a new value for a given key into one tree, and the value in another tree changed as well. Quite mysterious.

My B-trees are an implementation of the Rodeh B-trees. All updates are done using copy-on-write, and this allows efficient cloning of trees. However, as an optimization, a node that is used by only one tree at a time can be modified in-place. This speeds things up a lot when the same tree is updated many times.

In obnam, each B-tree represents a backup generation. The specific bug I was hunting was that the key I use for storing the unique identifier for a generation changed in two trees at the same time.

Now that I have actually found and fixed the problem, it is clear that I should have realized at once what was happening, but I didn't, so I spent another couple of days adding logging statements into the B-tree insertion code, and tracing where things went wrong.

Since B-trees are updated quite heavily by obnam, the log files grew quite big, and the test runs grew quite long. The last run I did resulted in a 94 gigabyte log file, and lasted over 12 hours. less had trouble browsing the log file, probaby due to the rather long lines generated by dumps of the B-trees. I ended up splitting the large file into quite a number of small ones, using csplit.

After some more analysis, I figured out the problem: the code that shadowed a B-tree node, i.e., implemented the copy-on-write modification, failed to update the reference counts of the children of a shadowed index node.

The Rodeh B-trees use lazy reference counts: when an index node is shadowed, only the immediate children get reference count updates. This turns out to work quite well, and saves quite massive amounts of updates. Unfortunately, my code was too lazy, and did not update any reference counts.

After I had come this far, it only took a couple of minutes to write a unit test to catch the problem, fix the code, run the unit tests successfully, and commit.

Then I ran the big test case again, without the extra logging I had added, and the code worked fine.

At least it worked fine for about 24 hours until it ate up all memory in the VM I was running it on, but that's a separate problem.

What can I do to prevent this kind of bug happening in the future? Unit testing didn't catch it, and I have 100% statement coverage with unit tests (not counting 19 excluded statements). Code review by other programmers would surely help. I should perhaps attract others to the project.

Please excuse the length of this blog post. It is a distillation of the about ten A4-pages' worth of notes I made during this debugging process. (Always make notes when debugging. Your brain can't remember all details.)