On the future of Linux distributions

[[!tag debian linux baserock]]

Executive summary

Debian should consider:

less fine-grained software packaging for overall simplification of the system and its development
less flexibility at the lower levels of the software stack, again for overall simplicity
reducing friction in the development workflow, especially by making it possible to branch and merge at the whole system level

Introduction

I've taken a job with Codethink, as part of a team to develop a new embedded Linux system, called Baserock. Baserock is not a distribution as such, but deals with many of the same issues. I thought it might be interesting to write out my new thoughts related to this, since they might be useful for proper distributions to think about too.

I'll hasten to add that many of these thoughts are not originally mine. I shamelessly adopt any idea that I think is good. The core ideas for Baserock come from Rob Taylor, Daniel Silverstone, and Paul Sherwood, my colleagues and bosses at Codethink. At the recent GNOME Summit in Montreal, I was greatly influenced by Colin Walters.

This is also not an advertisment for Baserock, but since my of my current thinking come from that project, I'll discuss things in the context of Baserock.

Finally, I'm writing this to express my personal opinion and thoughts. I'm not speaking for anyone else, least of all Codethink.

On the package abstraction

Baserock abandons the idea of packages for all individual programs. In the early 1990s, when Linux was new, and the number of packages in a distribution could be counted on two hands in binary, packages were a great idea. It was feasible to know at least something about every piece of software, and pick the exact set of software to install on a machine.

It is still feasible to do so, but only in quite restricted circumstances. For example, picking the packages to install a DNS server, or an NFS server, or a mail server, by hand, without using meta packages or tasks (in Debian terminology), is still quite possible. On embedded devices, there's also usually only a handful of programs installed, and the people doing the install can be expected to understand all of them and decide which ones to install on each specific device.

However, those are the exceptions, and they're getting rarer. For most people, manually picking software to install is much too tedious. In Debian, we've realised this a great many years ago, and developed meta packages (whose only purpose is to depend on other packages) and tasks (solves the same problem, but differently). These make it possible for a user to say "I want to have the GNOME desktop environment", and not have to worry about finding every piece that belongs in GNOME, and install that separately.

For much of the past decade, computers have had sufficient hard disk space that it is no longer necessary to be quite so picky about what to install. A new cheap laptop will now typically come with at least 250 gigabytes of disk space. An expensive one, with an SSD drive, will have at least 128 gigabytes. A fairly complete desktop install uses less than ten gigabytes, so there's rarely a need to pick and choose between the various components.

From a usability point of view, choosing from a list of a dozen or two options is much easier than from a list of thirty five thousand (the number of Debian packages as I write this).

This is one reason why Baserock won't be using the traditional package abstraction. Instead, we'll collect programs into larger collections, called strata, which form some kind of logical or functional whole. So there'll be one stratum for the core userland software: init, shell, libc, perhaps a bit more. There'll be another for a development environment, one for a desktop environment, etc.

Another, equally important reason, to move beyond packages is the problems caused by the combinatorial explosion of packages and versions. Colin Walters talks about this very well. When every system has a fairly unique set of packages, and versions of them, it becomes much harder to ensure that software works well for everyone, that upgrades work well, and that any problems get solved. When the number of possible package combinations is small, getting the interactions between various packages right is easier, QA has a much easier time to test all upgrade paths, and manual test coverage improves a lot when everyone is testing the same versions. Even debugging gets easier, when everyone can easily run the same versions.

Grouping software into bigger collections does reduce flexibility of what gets installed. In some cases this is important: very constrainted embedded devices, for example, still need to be very picky about what software gets installed. However, even for them, the price of flash storage is low enough that it might not matter too much, anymore. The benefit of a simpler overall system may well outweigh the benefit of fine-grained software choice.

Everything in git

In Baserock, we'll be building everything from source in git. It will not be possible to build anything, unless the source is committed. This will allow us to track, for each binary blob we produce, the precise sources that were used to bulid it.

We will also try to achieve something a bit more ambitious: anything that affects any bit in the final system image can be traced to files committed to git. This means tracking also all configuration settings for the build, and the whole build environment, in git.

This is important for us so that we can reproduce an image used in the field. When a customer is deploying a specific image, and needs it to be changed, we want to be able to make the change with the minimal changes compared to the previous version of the image. This requires that we can re-create the original image, from source, bit for bit, so that when we make the actual change, only the changes we need to make affect the image.

We will make it easy to branch and merge not just individual projects, but the whole system. This will make it easy to do large changes to the system, such as transitioning to a new version of GNOME, or the toolchain. Currently, in Debian, such large changes need to be serialised, so that they do not affect each other. It is easy, for example, for a GNOME transition to be broken by a toolchain transition.

Branching and merging has long been considered the best available solution for concurrent development within a project. With Baserock, we want to have that for the whole system. Our build servers will be able to build images for each branch, without requiring massive hardware investment: any software that is shared between branches only gets built once.

Launchpad PPAs and similar solutions provide many of the benefits of branching and merging on the system level. However, they're much more work than "git checkout -b gnome-3.4-transition". I believe that the git based approach will make concurrent development much more efficient. Ask me in a year if I was right.

Git, git, and only git

There are a lot of version control systems in active use. For the sake of simplicity, we'll use only git. When an upstream project uses something else, we'll import their code into git. Luckily, there are tools for that. The import and updates to it will be fully automatic, of course.

Git is not my favorite version control system, but it's clearly the winner. Everything else will eventually fade away into obscurity. Or that's what we think. If it turns out that we're wrong about that, we'll switch to something else. However, we do not intend to have to deal with more than one at a time. Life's too short to use all possible tools at the same time.

Tracking upstreams closely

We will track upstream version control repositories, and we will have an automatic mechanism of building our binaries directly from git. This will, we hope, make it easy to follow closely upstream development, so that when, say, GNOME developers make commits, we want to be able to generate a new system image which includes those changes the same day, if not within minutes, rather than waiting days or weeks or months.

This kind of closeness is greatly enhanced by having everything in version control. When upstream commits changes to their version control system, we'll mirror them automatically, and this then triggers a new system image build. When upstream makes changes that do not work, we can easily create a branch from any earlier commit, and build images off that branch.

This will, we hope, also make it simpler to make changes, and give them back to upstream. Whenever we change anything, it'll be done in a branch, and we'll have system images available to test the change. So not only will upstream be able to easily get the change from our git repository, they'll also be easily verify, on a running system, that the change fixes the problem.

Automatic testing, maybe even test driven development

We will be automatically building system images from git commits for Baserock. This will potentially result in a very large number of images. We can't possibly test all of them manually, so we will implement some kind of automatic testing. The details of that are still under discussion.

I hope to be able to start adding some test driven development to Baserock systems. In other words, when we are requested to make changes to the system, I want the specification to be provided as executable tests. This will probably be impossible in real life, but I can hope.

I've talked about doing the same thing for Debian, but it's much harder to push through such changes in an established, large project.

Solving the install, upgrade, and downgrade problems

All mainstream Linux distributions are based on packages, and they all, pretty much, do installations and upgrades by unpacking packages onto a running system, and then maybe running some scripts from the packages. This works well for a completely idle system, but not so well on systems that are in active use. Colin Walters again talks about this.

For installation of new software, the problem is that someone or something may invoke it before it is fully configured by the package's maintainer script. For example, a web application might unpack in such a way that the web server notices it, and a web browser may request the web app to run before it is configured to connect to the right database. Or a GUI program may unpack a .desktop file before the executable or its data files are unpacked, and a user may notice the program in their menu and start it, resulting in an error.

Upgrades suffer from additonal problems. Software that gets upgraded may be running during the upgrade. Should the package manager replace the software's data files with new versions, which may be in a format that the old program does not understand? Or install new plugins that will cause the old version of the program to segfault? If the package manager does that, users may experience turbulence without having put on their seat belts. If it doesn't do that, it can't install the package, or it needs to wait, perhaps for a very long time, for a safe time to do the upgrade.

These problems have usually been either ignored, or solved by using package specific hacks. For example, plugins might be stored in a directory that embeds the program's version number, ensuring that the old version won't see the new plugins. Some people would like to apply installs and upgrades only at shutdown or bootup, but that has other problems.

None of the hacks solve the downgrade problem. The package managers can replace a package with an older version, and often this works well. However, in many cases, any package maintainer scripts won't be able to deal with downgrades. For example, they might convert data files to a new format or name or location upon upgrades, but won't try to undo that if the package gets downgraded. Given the combinatorial explosion of package versions, it's perhaps just as well that they don't try.

For Baserock, we absolutely need to have downgrades. We need to be able to go back to a previous version of the system if an upgrade fails. Traditionally, this has been done by providing a "factory reset", where the current version of the system gets replaced with whatever version was installed in the factory. We want that, but we also want to be able to choose other versions, not just the factory one. If a device is running version X, and upgrades to X+1, but that version turns out to be a dud, we want to be able to go back to X, rather than all the way back to the factory version.

The approach we'll be taking with Baserock relies on btrfs and subvolumes and snapshots. Each version of the system will be installed in a separate subvolume, which gets cloned from the previous version, using copy-on-write to conserve space. We'll make the bootloader be able to choose a version of the system to boot, and (waving hands here) add some logic to be able to automatically revert to the previous version when necessary. We expect this to work better and more reliably than the current package based one.

Making choices

Debian is pretty bad at making choices. Almost always, when faced with a need to choose between alternative solutions for the same problem, we choose all of them. For example, we support pretty much every init implementation, various implementations of /bin/sh, and we even have at least three entirely different kernels.

Sometimes this is non-choice is a good thing. Our users may need features that only one of the kernels support, for example. And we certainly need to be able to provide both mysql and postresql, since various software we want to provide to our uses needs one and won't work with the other.

At other times, the inability to choose causes trouble. Do we really need to support more than one implemenation of /bin/sh? By supporting both dash and bash for that, we double the load on testing and QA, and introduce yet another variable to deal with into any debugging situation involving shell scripts.

Especially for core components of the system, it makes sense to limit the flexibility of users to pick and choose. Combinatorial explosion déjà vu. Every binary choice doubles the number of possible combinations that need to be tested and supported and checked during debugging.

Flexibility begets complexity, complexity begets problems.

This is less of a problem at upper levels of the software stack. At the very top level, it doesn't really matter if there are many choices. If a user can freely choose between vi and Emacs, and this does not add complexity at the system level, since nothing else is affected by that choice. However, if we were to add a choice between glibc, eglibc, and uClibc for the system C library, then everything else in the system needs to be tested three times rather than once.

Reducing the friction coefficient for system development

Currently, a Debian developer takes upstream code, adds packaging, perhaps adds some patches (using one of several methods), builds a binary package, tests it, uploads it, and waits for the build daemons and the package archive and user-testers to report any problems.

That's quite a number of steps to go through for the simple act of adding a new program to Debian, or updating it to a new version. Some of it can be automated, but there's still hoops to jump through.

Friction does not prevent you from getting stuff done, but the more friction there is, the more energy you have to spend to get it done. Friction slows down the crucial hack-build-test cycle of software development, and that hurts productivity a lot. Every time a developer has to jump through any hoops, or wait for anything, he slows down.

It is, of course, not just a matter of the number of steps. Debian requires a source package to be uploaded with the binary package. Many, if not most, packages in Debian are maintained using version control systems. Having to generate a source package and then wait for it to be uploaded is unnecessary work. The build daemon could get the source from version control directly. With signed commits, this is as safe as uploading a tarball.

The above examples are specific to maintaining a single package. The friction that really hurts Debian is the friction of making large-scale changes, or changes that affect many packages. I've already mention the difficulty of making large transitions above. Another case is making policy changes, and then implementing them. An excellent example of that is in Debian is the policy change to use /usr/share/doc for documentation, instead of /usr/doc. This took us many years to do. We are, I think, perhaps a little better at such things now, but even so, it is something that should not take more than a few days to implement, rather than half a decade.

On the future of distributions

Occasionally, people say things like "distributions are not needed", or that "distributions are an unnecessary buffer between upstream developers and users". Some even claim that there should only be one distribution. I disagree.

A common view of a Linux distribution is that it takes some source provided by upstream, compiles that, adds an installer, and gives all of that to the users. This view is too simplistic.

The important part of developing a distribution is choosing the upstream projects and their versions wisely, and then integrating them into a whole system that works well. The integration part is particularly important. Many upstreams are not even aware of each other, nor should they need to be, even if their software may need to interact with each other. For example, not every developer of HTTP servers should need to be aware of every web application, or vice versa. (It they had to be, it'd be a combinatorial explosion that'd ruin everything, again.)

Instead, someone needs to set a policy of how web apps and web servers interface, what their common interface is, and what files should be put where, for web apps to work out of the box, with minimal fuss for the users. That's part of the integration work that goes into a Linux distribution. For Debian, such decisions are recorded in the Policy Manual and its various sub-policies.

Further, distributions provide quality assurance, particularly at the system level. It's not realistic to expect most upstream projects to do that. It's a whole different skillset and approach that is needed to develop a system, rather than just a single component.

Distributions also provide user support, security support, longer term support than many upstreams, and port software to a much wider range of architectures and platforms than most upstreams actively care about, have access to, or even know about. In some cases, these are things that can and should be done in collaboration with upstreams; if nothing else, portability fixes should be given back to upstreams.

So I do think distributions have a bright future, but the way they're working will need to change.