[[!tag debian presentation]]

This is an essay form of a talk I have given today at the Cambridge Mini-debconf. The talk was videoed, so will presumably show up in the Debconf video archive eventually.

Abstract

Debian has a long and illustrious history. However, some of the things we do perhaps no longer make as much sense as they used to do. It's a new millennium, and we might find better ways of doing things. Things that used to be difficult might now be easy, if we dare look at things from a fresh perspective. I have been doing that at work, for the Baserock project, and this talk is a compilation of observations based on what I've learnt, concentrating on things that affect the development workflow of package maintainers.

Introduction and background

I have been a Debian developer since August, 1996 (modulo a couple of retirements), and have used it a little bit longer. I have done a variety of things for Debian, from maintaining PGP 2 packages to writing piuparts to blathering excessively on the mailing lists.

My day job is to develop the Baserock system at Codethink. Baserock is a set of tools and workflows for developing embedded and appliance Linux systems. If you squint, it looks a bit like a source-based Linux distribution. I have worked on Baserock since September, 2011.

Some of Baserock's design has been influenced by my experience with Debian. With Baserock I have the chance to fix all the things that are wrong in Debian, and this talk is me giving back to Debian by pointing out some the things I feel should be fixed. I don't have solutions for these problems: this is a bug report, not a patch. It's also perhaps a bit of a rant.

I am specifically concentrating here on technical and tooling issues that affect the development process of Debian. I excluding social issues.

I am also not trying to get Debian to switch to Baserock. Baserock is targeting embedded and appliance systems, and makes simplifying assumptions based on those targets, which Debian does not get to do. I am pointing out problems, and I am outlining solutions as implemented in Baserock, when I think the concept carries over well to Debian.

Build tools should be intelligent, packaging should be dumb

In Debian, the fundamental tool for compiling the upstream code and assembling a binary package is dpkg-buildpackage. It uses debian/rules, which is a Makefile with targets with specific names and semantics. By executing the right targets in the right order, dpkg-buildpackage tells the source package to build the binary package.

On the one hand, this is a nice design, because it abstracts away the large variety of upstream build systems into one API for dpkg-buildpackage to use. On the other hand, it puts all the intelligence for how packages are built into the packages.

Making packaging intelligent, rather than the build tool, means packagers need to do more work, and there's more work to do when, say, the Debian policy changes, or when there are other changes that affect a large number of packages. If packaging is intelligent, then every package needs changes. If the build tool is intelligent, then you change the build tool and re-build everything.

In Baserock we put as much intelligence as we can into the Morph tool, which drives the build process. It turns out that today, unlike in 1995, most upstream projects use one of a handful well-known build systems: autotools, cpan, cmake, Python distutils, etc. With just a little extra logic in Morph we avoid having any delta against upstream in Baserock. This doesn't work for quite all upstream projects, of course, but we've spent the 2% (two, not twenty) of effort that solves 80% of the problem.

In recent years, the dh approach to packaging has made a lot of packages be able to have only a minimal, 3-line debian/rules file. This is excellent. Wouldn't it be nice if even that wasn't needed? It'd save thousands of files in source packages across the archive, at least. It would be easy to: if the file is missing, have dpkg-buildpackage assume the 3-line version by default.

Getting rid of a single file is, of course, not a particularly big win. The big win is the change in mindset: rather than dealing with all new issues in development by adding yet more fields to debian/control and more optional, competing tooling outside the core toolset, if you improve the tools everyone uses, then everyone's packages gets better.

The goal should, in my opinion, be that for the large number of packages where upstream uses a well-known, well-behaved build system, and uses it in a reasonably sensible way, the Debian source package should not require anything added to make the package build. There will still be a need to add some stuff, such as the debian/copyright file, to make a good Debian package, but just getting the package to build should require nothing extra. (Side note: wouldn't it be nice if there was a well-known, widely used way to mark up copyright information so that debian/copyright could be constructed and updated automatically?)

Configuration file handling on upgrades: add ucf to dpkg already

In the 1990s, dpkg had excellent handling of configuration files and merging local changes with changes from the new package version, but it was excellent only because it tried to do that at all, and mostly nothing else did. It hasn't changed much, since, and it's not excellent on any absolute scale.

We have the ucf tool, which can do a better job, but requires to be added to each package wanting to use it. Why don't we make dpkg smarter, instead? If ucf is not not good enough to be merged wholesale into dpkg, let's write something better.

Making every package maintainer use ucf manually is just wasteful. This is not the kind of thing that should be done for each package separately: the package manager should be smart so that packaging can be stupid.

The goal should be that dpkg is smart enough in its configuration file handling that having the package do it is a very rare special case.

Clean building shouldn't be hard

The basic tool for building a package is dpkg-buildpackage. It is ever so slightly cumbersome to use, so there are some wrappers, most importantly debuild. However, if you're making a build intended to be uploaded to the Debian archive, you should be doing a clean build. This means having to learn, configure, and use yet more tools.

A clean build is important: security updates, development, debugging, quality control, porting, and user support become more difficult if we don't know how a package was built, and can't reproduce the build. It gets harder to keep build dependencies correct, making it harder for everyone to build things.

Luckily, Debian has solved the clean build problem. Unluckily, it has solved it multiple times, creating the problem of having to choose the clean building approach you want to use. The default way of building is not clean, so then you have to remember to use the non-standard way. You also get to spend time maintaining the clean build environments yourself, since that doesn't seem to be fully automated. None of this is hard, as such, but it's extra friction in the development workflow.

The primary approaches for clean building in Debian seem to be pbuilder, cowbuilder, and sbuild. I happen to use pbuilder myself, because it's what's been around the longest, but I make no claim of having made an informed choice.

That is part of the problem here: why should I have to spend the effort to become informed to make a choice well? Why is the default way of building not clean? Don't say performance: Morph sets up a clean staging area in seconds, and does not offer you a choice of not doing so. It's a chroot, with everything but the build tree hardlinked from cached, unpacked build dependencies, and protected using read-only bind mounts.

What's more, this approach avoids having to maintain the pbuilder base tarballs or sbuild chroots manually. It's all automatic, and up to date, for every build. Furthermore, the staging area contains only the specified build dependencies, and not anything else, meaning a build fails if a build dependency is missing, rather than succeeding because it happens to be in the default set of packages the build tool installs.

Mass building shouldn't be hard

Suppose you want to try out a large-scale change in Debian. It might be trying out a new version of GCC, or using llvm's clang as the default C compiler, or updating glibc, or doing a large library transition such as a new version of GTK+, or trying a new dpkg version, or even something more exploratory such as changing the default optimisation flags for all packages to see what breaks.

All of these will require you to at least rebuild all affected packages. Ideally you'd test the built packages as well, but let's concentrate on the building for now.

Here's what you do. You make the changes you want to try out, and build those packages. You create a dedicated APT repository, and upload your packages there. You configure your build environment to favour that APT repository. You make a list of all reverse build dependencies of the packages you changed. You write a script to build all of those, preferably so that if you change A, and rebuild B, then you also rebuild C which build depends on B. Each rebuilt package you also upload to your APT repository. You keep track of the build log, and success or failure, of each build.

The people in Debian who do this kind of stuff regularly presumably have tools for doing it. It shouldn't be a rare, special thing, though. If my package has reverse build dependencies, I should at least consider testing building when I'm making changes. Otherwise, it might take years until the reverse build dependencies are rebuilt, and the problem is only found then, making it harder to fix.

To be fair, building a lot of packages takes a lot of resources. It's not feasible to rebuild everything in Debian every time there's any change to, say, eglibc. However, it's feasible to do it, for large subsets of the archive, without huge hardware investments.

One VCS to rule them all

In 1995 there was really only one relevant version control system: CVS. It was not a great tool. In 2000, another contender existed: Subversion. It fixed some problems in CVS, but still wasn't a great tool. In 2005, there was a great upheaval and distributed version control systems started to become mainstream. There were a large handful of them. In 2010, it was becoming pretty clear that git had won. It's ugly, but it's powerful.

I'm not going to debate the relative merits of different version control systems. Until recently, I was a Bazaar boy, and all of my personal projects were kept in Bazaar. (I have recently switched everything to git.)

There are, however, strong benefits from everyone using the same system. Developers don't need to learn a dozen version control systems. Tools that operate on many repositories are easier to write and maintain. Workflows become simpler if one system can be assumed.

Debian has a strong historical tendency to choose every option. This is sometimes a good thing, and sometimes a bad thing. For keeping source packages in version control I believe it is to be a bad thing. The status quo is that a Debian source package may not be in version control at all, or it might be in any version control system.

This is acceptable when everyone only ever needs to maintain their own packages. However, in a distribution the size of Debian, that is not the case. NMUs, security support, archive wide transitions, and other situations arise when I might need to change yours, or you might need to change mine.

We try to work around this by having a complicated source package format, using quilt to maintain patches to upstream semi-manually in a debian/patches directory. This is an awkward workflow. It's a workflow that trips up those that are not used to it. (I know quilt is a patch management system, not a version control system. I think git does it much better anyway.)

It would be oh so much easier if everyone kept their source packages in the same, real version control system. I don't even care what it is, as long as it is powerful enough to handle the use cases we have.

Imagine a world where every Debian source package is kept, for argument's sake, git, and everyone also uses the same layout and roughly the same workflow to maintain it. What would this mean?

It would mean that if you want to inspect the history of your package, you know how to do that. If you want to merge in some bugfix from upstream code, you know how to do that, without having to figure out which of the several source package formats are in use.

It would make feasible the development of more powerful, more higher level tooling. For example, it would allow Debian to have what we call system branches in Baserock. In Debian we have stable, testing, unstable, and experimental. We may get something like Ubuntu's PPAs, or perhaps an improved version of those. These are very poor versions of system branches, just like quilt is a poor way to manage patches and deltas against upstream. For example, you can upload an experimental version of gccto experimental, but then nobody else can upload another experimental version. You can set up your own PPA for this, but you'll still be affected by all the uploads to unstable while you're working.

A Baserock system branch is a branch of the entire system, or the entire distribution in the case of Debian. It is isolated from other system branches. In branch in an individual repository is a well-known concept. A system branch is conceptually like branching every repository in the distribution at once. The actual implementation is more efficient, of course.

This would be possible to implement without standardising on one version control system, but it would be much harder to implement, and would have to live with the lowest common denominator for features. CVS and Subversion, for example, don't really to merges, where Bazaar, Mercurial, and git do. Possible does not mean feasible.

Any work you do in a system branch is isolated. Your work doesn't affect others, and theirs doesn't affect yours, until a merge happens. This is a simple, but very powerful tool.

Cheap system branches, and powerful merging, makes it possible to do experiments safely, with little fuss. Combine that with being able to build everything cleanly and quickly, and get into a situation where there's no need to have make technical decisions based on arguments on mailing lists, and instead they can be done by choosing by looking at working code.

I don't know how this could be implemented in Debian, but think about it. If Debian could have this, it might many archive-scale changes easier.

debian/rules clean: really?

One of the silliest things we require of packages is that they have a debian/rules clean rule that cleans up after a build perfectly so that we can do repeated builds in the same source tree.

Let's just use git clean -fdx instead.

This is a problem that is superbly well suited for automation. There is no point whatsover making packagers do any manual work for this.

Large scale semi-mechanical changes require too much effort

About a decade ago, we decided to follow a new version of the Filesystem Hierarchy Standard and transition from /usr/doc to /usr/share/doc. This was an almost entirely mechanical change: in many cases, a mere rebuild would fix it, and in almost every other case it was just a minor tweak to the packaging. A one-line change. It took us seven years to do this.

Seven years. Think about it.

In a recent discussion about building R binary data files from source at package build time it was suggested that we take 2-3 release cycles to get this done. That's four to six years. Think about it.

These are not isolated cases. Every time we need to make a change that affects more than a small handful of packages, it becomes a major undertaking. Most of the time all the people involved in this are agreeable to the change, and welcome it. The change takes long because it requires co-ordinating a large number of people. Sometimes people are just busy. Sometimes they've left the project, but haven't properly orphaned their packages. Waiting for a timeout for answers about packages drags the process longer.

Mechanical changes, or semi-mechanical ones, which are very easy and very quick to do should not take years. They should take an evening, no more.

There's no end of changes we might want to do like this. In 2005 and 2006 I filed a few hundred bugs from failed piuparts runs. They're still being fixed, even when the fix is simple, such as adding a postrm script to remove, when the package is purged, a configuration file created by postinst, or starting the service with invoke-rc.d rather than running an init.d script directly. Mostly one-line changes.

There's more mechanical changes that might happen. For example, changing the value of the VCS-Browser field when the Debian version control server changes domain names.

It's not just about simple, mechanical changes, either. Transitions of important library packages, for example, which require changes to reverse dependencies due to a changed API are another source of pain. What should be a few evenings of build fixes can drag out to months of co-ordination.

This is caused partly by technical issues, and partly due to social issues. The main social issue is that we have quite a strong ownership of packages, and NMUs are to be done carefully, and only in some cases. This is both good and bad, and I won't discuss that aspect now. The technical issues are that our tools are primarily intended for maintaining individual packages, rather than all of them together, making it harder to make even simple changes in large number of packages.

In addition to easy mass building and system branches, as outlined above, large-scale changes would require testing tools so that you don't build the new package versions, but also test things automatically. Essentially, CI at the distribution level.

Conclusion

I've listed above a small variety of problems I see in the Debian development processes and tools. They're not the important part of this talk. The important part is that it we, the Debian developers, should look at our tools, and workflows, critically, and improve them when we can. Even small improvements are very useful, when they affect each of our twenty thousand source packages. The important change I argue for here is one in mindset, rather than any specific problem in any tool. We need a mindset of constant, incremental improvement for our tools.