[[!tag debian presentation]]
This is an essay form of a talk I have given today at the Cambridge Mini-debconf. The talk was videoed, so will presumably show up in the Debconf video archive eventually.
Abstract
Debian has a long and illustrious history. However, some of the things we do perhaps no longer make as much sense as they used to do. It's a new millennium, and we might find better ways of doing things. Things that used to be difficult might now be easy, if we dare look at things from a fresh perspective. I have been doing that at work, for the Baserock project, and this talk is a compilation of observations based on what I've learnt, concentrating on things that affect the development workflow of package maintainers.
Introduction and background
I have been a Debian developer since August, 1996 (modulo a couple of
retirements), and have used it a little bit longer. I have done a
variety of things for Debian, from maintaining PGP 2 packages to
writing piuparts
to blathering excessively on the mailing lists.
My day job is to develop the Baserock system at Codethink. Baserock is a set of tools and workflows for developing embedded and appliance Linux systems. If you squint, it looks a bit like a source-based Linux distribution. I have worked on Baserock since September, 2011.
Some of Baserock's design has been influenced by my experience with Debian. With Baserock I have the chance to fix all the things that are wrong in Debian, and this talk is me giving back to Debian by pointing out some the things I feel should be fixed. I don't have solutions for these problems: this is a bug report, not a patch. It's also perhaps a bit of a rant.
I am specifically concentrating here on technical and tooling issues that affect the development process of Debian. I excluding social issues.
I am also not trying to get Debian to switch to Baserock. Baserock is targeting embedded and appliance systems, and makes simplifying assumptions based on those targets, which Debian does not get to do. I am pointing out problems, and I am outlining solutions as implemented in Baserock, when I think the concept carries over well to Debian.
Build tools should be intelligent, packaging should be dumb
In Debian, the fundamental tool for compiling the upstream code and
assembling a binary package is dpkg-buildpackage
. It uses
debian/rules
, which is a Makefile with targets with specific names
and semantics. By executing the right targets in the right order,
dpkg-buildpackage
tells the source package to build the binary
package.
On the one hand, this is a nice design, because it abstracts away the
large variety of upstream build systems into one API for
dpkg-buildpackage
to use. On the other hand, it puts all the
intelligence for how packages are built into the packages.
Making packaging intelligent, rather than the build tool, means packagers need to do more work, and there's more work to do when, say, the Debian policy changes, or when there are other changes that affect a large number of packages. If packaging is intelligent, then every package needs changes. If the build tool is intelligent, then you change the build tool and re-build everything.
In Baserock we put as much intelligence as we can into the Morph tool, which drives the build process. It turns out that today, unlike in 1995, most upstream projects use one of a handful well-known build systems: autotools, cpan, cmake, Python distutils, etc. With just a little extra logic in Morph we avoid having any delta against upstream in Baserock. This doesn't work for quite all upstream projects, of course, but we've spent the 2% (two, not twenty) of effort that solves 80% of the problem.
In recent years, the dh
approach to packaging has made a lot of
packages be able to have only a minimal, 3-line debian/rules
file.
This is excellent. Wouldn't it be nice if even that wasn't needed?
It'd save thousands of files in source packages across the archive, at
least. It would be easy to: if the file is missing, have
dpkg-buildpackage
assume the 3-line version by default.
Getting rid of a single file is, of course, not a particularly big
win. The big win is the change in mindset: rather than dealing with
all new issues in development by adding yet more fields to
debian/control
and more optional, competing tooling outside the core
toolset, if you improve the tools everyone uses, then everyone's
packages gets better.
The goal should, in my opinion, be that for the large number of
packages where upstream uses a well-known, well-behaved build system,
and uses it in a reasonably sensible way, the Debian source package
should not require anything added to make the package build. There
will still be a need to add some stuff, such as the debian/copyright
file, to make a good Debian package, but just getting the package to
build should require nothing extra. (Side note: wouldn't it be nice if
there was a well-known, widely used way to mark up copyright
information so that debian/copyright
could be constructed and
updated automatically?)
Configuration file handling on upgrades: add ucf to dpkg already
In the 1990s, dpkg
had excellent handling of configuration files and
merging local changes with changes from the new package version, but
it was excellent only because it tried to do that at all, and mostly
nothing else did. It hasn't changed much, since, and it's not
excellent on any absolute scale.
We have the ucf
tool, which can do a better job, but requires to be
added to each package wanting to use it. Why don't we make dpkg
smarter, instead? If ucf
is not not good enough to be merged
wholesale into dpkg
, let's write something better.
Making every package maintainer use ucf
manually is just wasteful.
This is not the kind of thing that should be done for each package
separately: the package manager should be smart so that packaging can
be stupid.
The goal should be that dpkg
is smart enough in its configuration
file handling that having the package do it is a very rare special
case.
Clean building shouldn't be hard
The basic tool for building a package is dpkg-buildpackage
. It is
ever so slightly cumbersome to use, so there are some wrappers, most
importantly debuild
. However, if you're making a build intended to
be uploaded to the Debian archive, you should be doing a clean build.
This means having to learn, configure, and use yet more tools.
A clean build is important: security updates, development, debugging, quality control, porting, and user support become more difficult if we don't know how a package was built, and can't reproduce the build. It gets harder to keep build dependencies correct, making it harder for everyone to build things.
Luckily, Debian has solved the clean build problem. Unluckily, it has solved it multiple times, creating the problem of having to choose the clean building approach you want to use. The default way of building is not clean, so then you have to remember to use the non-standard way. You also get to spend time maintaining the clean build environments yourself, since that doesn't seem to be fully automated. None of this is hard, as such, but it's extra friction in the development workflow.
The primary approaches for clean building in Debian seem to be
pbuilder
, cowbuilder
, and sbuild
. I happen to use pbuilder
myself, because it's what's been around the longest, but I make no
claim of having made an informed choice.
That is part of the problem here: why should I have to spend the effort to become informed to make a choice well? Why is the default way of building not clean? Don't say performance: Morph sets up a clean staging area in seconds, and does not offer you a choice of not doing so. It's a chroot, with everything but the build tree hardlinked from cached, unpacked build dependencies, and protected using read-only bind mounts.
What's more, this approach avoids having to maintain the pbuilder
base tarballs or sbuild
chroots manually. It's all automatic, and up
to date, for every build. Furthermore, the staging area contains only
the specified build dependencies, and not anything else, meaning a
build fails if a build dependency is missing, rather than succeeding
because it happens to be in the default set of packages the build tool
installs.
Mass building shouldn't be hard
Suppose you want to try out a large-scale change in Debian. It might
be trying out a new version of GCC, or using llvm's clang
as the
default C compiler, or updating glibc, or doing a large library
transition such as a new version of GTK+, or trying a new dpkg
version, or even something more exploratory such as changing the
default optimisation flags for all packages to see what breaks.
All of these will require you to at least rebuild all affected packages. Ideally you'd test the built packages as well, but let's concentrate on the building for now.
Here's what you do. You make the changes you want to try out, and build those packages. You create a dedicated APT repository, and upload your packages there. You configure your build environment to favour that APT repository. You make a list of all reverse build dependencies of the packages you changed. You write a script to build all of those, preferably so that if you change A, and rebuild B, then you also rebuild C which build depends on B. Each rebuilt package you also upload to your APT repository. You keep track of the build log, and success or failure, of each build.
The people in Debian who do this kind of stuff regularly presumably have tools for doing it. It shouldn't be a rare, special thing, though. If my package has reverse build dependencies, I should at least consider testing building when I'm making changes. Otherwise, it might take years until the reverse build dependencies are rebuilt, and the problem is only found then, making it harder to fix.
To be fair, building a lot of packages takes a lot of resources. It's not feasible to rebuild everything in Debian every time there's any change to, say, eglibc. However, it's feasible to do it, for large subsets of the archive, without huge hardware investments.
One VCS to rule them all
In 1995 there was really only one relevant version control system:
CVS. It was not a great tool. In 2000, another contender existed:
Subversion. It fixed some problems in CVS, but still wasn't a great
tool. In 2005, there was a great upheaval and distributed version
control systems started to become mainstream. There were a large
handful of them. In 2010, it was becoming pretty clear that git
had
won. It's ugly, but it's powerful.
I'm not going to debate the relative merits of different version control systems. Until recently, I was a Bazaar boy, and all of my personal projects were kept in Bazaar. (I have recently switched everything to git.)
There are, however, strong benefits from everyone using the same system. Developers don't need to learn a dozen version control systems. Tools that operate on many repositories are easier to write and maintain. Workflows become simpler if one system can be assumed.
Debian has a strong historical tendency to choose every option. This is sometimes a good thing, and sometimes a bad thing. For keeping source packages in version control I believe it is to be a bad thing. The status quo is that a Debian source package may not be in version control at all, or it might be in any version control system.
This is acceptable when everyone only ever needs to maintain their own packages. However, in a distribution the size of Debian, that is not the case. NMUs, security support, archive wide transitions, and other situations arise when I might need to change yours, or you might need to change mine.
We try to work around this by having a complicated source package
format, using quilt
to maintain patches to upstream semi-manually in
a debian/patches
directory. This is an awkward workflow. It's a
workflow that trips up those that are not used to it. (I know quilt
is a patch management system, not a version control system. I think
git does it much better anyway.)
It would be oh so much easier if everyone kept their source packages in the same, real version control system. I don't even care what it is, as long as it is powerful enough to handle the use cases we have.
Imagine a world where every Debian source package is kept, for
argument's sake, git
, and everyone also uses the same layout and
roughly the same workflow to maintain it. What would this mean?
It would mean that if you want to inspect the history of your package, you know how to do that. If you want to merge in some bugfix from upstream code, you know how to do that, without having to figure out which of the several source package formats are in use.
It would make feasible the development of more powerful, more higher
level tooling. For example, it would allow Debian to have what we call
system branches in Baserock. In Debian we have stable, testing,
unstable, and experimental. We may get something like Ubuntu's PPAs,
or perhaps an improved version of those. These are very poor versions
of system branches, just like quilt is a poor way to manage patches
and deltas against upstream. For example, you can upload an
experimental version of gcc
to experimental
, but then nobody else
can upload another experimental version. You can set up your own PPA
for this, but you'll still be affected by all the uploads to
unstable
while you're working.
A Baserock system branch is a branch of the entire system, or the entire distribution in the case of Debian. It is isolated from other system branches. In branch in an individual repository is a well-known concept. A system branch is conceptually like branching every repository in the distribution at once. The actual implementation is more efficient, of course.
This would be possible to implement without standardising on one version control system, but it would be much harder to implement, and would have to live with the lowest common denominator for features. CVS and Subversion, for example, don't really to merges, where Bazaar, Mercurial, and git do. Possible does not mean feasible.
Any work you do in a system branch is isolated. Your work doesn't affect others, and theirs doesn't affect yours, until a merge happens. This is a simple, but very powerful tool.
Cheap system branches, and powerful merging, makes it possible to do experiments safely, with little fuss. Combine that with being able to build everything cleanly and quickly, and get into a situation where there's no need to have make technical decisions based on arguments on mailing lists, and instead they can be done by choosing by looking at working code.
I don't know how this could be implemented in Debian, but think about it. If Debian could have this, it might many archive-scale changes easier.
debian/rules clean
: really?
One of the silliest things we require of packages is that they have a
debian/rules clean
rule that cleans up after a build perfectly so
that we can do repeated builds in the same source tree.
Let's just use git clean -fdx
instead.
This is a problem that is superbly well suited for automation. There is no point whatsover making packagers do any manual work for this.
Large scale semi-mechanical changes require too much effort
About a decade ago, we decided to follow a new version of the
Filesystem Hierarchy Standard and transition from /usr/doc
to
/usr/share/doc
. This was an almost entirely mechanical change: in
many cases, a mere rebuild would fix it, and in almost every other
case it was just a minor tweak to the packaging. A one-line change.
It took us seven years to do this.
Seven years. Think about it.
In a recent discussion about building R binary data files from source at package build time it was suggested that we take 2-3 release cycles to get this done. That's four to six years. Think about it.
These are not isolated cases. Every time we need to make a change that affects more than a small handful of packages, it becomes a major undertaking. Most of the time all the people involved in this are agreeable to the change, and welcome it. The change takes long because it requires co-ordinating a large number of people. Sometimes people are just busy. Sometimes they've left the project, but haven't properly orphaned their packages. Waiting for a timeout for answers about packages drags the process longer.
Mechanical changes, or semi-mechanical ones, which are very easy and very quick to do should not take years. They should take an evening, no more.
There's no end of changes we might want to do like this. In 2005 and
2006 I filed a few hundred bugs from failed piuparts
runs. They're
still being fixed, even when the fix is simple, such as adding a
postrm
script to remove, when the package is purged, a configuration
file created by postinst, or starting the service with invoke-rc.d
rather than running an init.d
script directly. Mostly one-line
changes.
There's more mechanical changes that might happen. For example,
changing the value of the VCS-Browser
field when the Debian version
control server changes domain names.
It's not just about simple, mechanical changes, either. Transitions of important library packages, for example, which require changes to reverse dependencies due to a changed API are another source of pain. What should be a few evenings of build fixes can drag out to months of co-ordination.
This is caused partly by technical issues, and partly due to social issues. The main social issue is that we have quite a strong ownership of packages, and NMUs are to be done carefully, and only in some cases. This is both good and bad, and I won't discuss that aspect now. The technical issues are that our tools are primarily intended for maintaining individual packages, rather than all of them together, making it harder to make even simple changes in large number of packages.
In addition to easy mass building and system branches, as outlined above, large-scale changes would require testing tools so that you don't build the new package versions, but also test things automatically. Essentially, CI at the distribution level.
Conclusion
I've listed above a small variety of problems I see in the Debian development processes and tools. They're not the important part of this talk. The important part is that it we, the Debian developers, should look at our tools, and workflows, critically, and improve them when we can. Even small improvements are very useful, when they affect each of our twenty thousand source packages. The important change I argue for here is one in mindset, rather than any specific problem in any tool. We need a mindset of constant, incremental improvement for our tools.