On the cost analysis of dependencies

[[!tag programming obnam]]

A question that I'm asked repeatedly recently is why I chose not to use an existing library for serialising data structures in Obnam. This blog post is the answer.

Obnam is a backup program, and it needs to store various data about files. This includes stat(2) information about each file in the live data, as well as data Obnam needs to keep track of everything. At run-time, Obnam keeps this data in memory data structures, such as Python dicts. For storage, these data structures need to be converted, serialised, to and from streams of bytes.

The are a variety of libraries for doing this, designed for different purposes and with their own constraints and pitfalls. Python's standard library comes with the cPickle library, for example, but its serialisation format is not guaranteed to be compatible with any other version of Python.

For Obnam, I need something that will last a long time. I do not want to have to deal with a library changing its serialisation format, as that would mean either that Obnam can't handle old backups, or that I need to start maintaining the old version of the library.

A way to look at this is that any dependencies your software have a cost, and that cost should be smaller than the benefit you get from them.

For example, Obnam depends on the paramiko library to implement the SSH protocol. This library has some cost, and I've run into one or two bugs in it that have been rather unfortunate. However, the benefit it brings is huge: I don't have to implement SSH myself. I'm happy to have Obnam depend on paramiko.

For the serialisation thing, I wrote my own library, after a small about of research into existing ones. Research time is a cost, too.

Mine is somewhat Obnam specific, in that it can make some assumptions about the data to be serialised, and this allows a simpler library. A generic library would have to handle a number of special cases that mine can ignore.

It took me less than an hour to write this twice. I first wrote a quick prototype and a little microbenchmark to see if my approach would be feasible. Then I deleted that code, and started from scratch, TDD style, to make sure the code was reliable. The cost of writing my own serialisation code was less than the cost of finding, let alone evaluating existing libraries.

It may be that my own library turns out to be inadequate. Then, and only then, is when I start researching other libraries. Until then, I'll avoid the cost of research to find a suitable library, the cost of learning the chosen one, the cost of integrating it into Obnam, the cost to porters of Obnam of dealing with a new dependency, and the risk of the library changing in ways that are unsuitable for Obnam.

Obviously, writing your own code has costs, too. Designing and implementing a library is a cost, as is maintaining it (debugging, changes in requirements, etc).

Write your own or use existing code? It's a cost/benefit analysis. There's no clear one answer that's always correct.