[[!tag debian ubuntu]]

This is a repost to my blog of the message I sent to ubuntu-devel; see the archive for the original.

I'm working on a spec to add apt-sync support for karmic. See https://wiki.ubuntu.com/AptSyncInKarmicSpec for details.

I've done some initial benchmarking of using zsync to download .debs for updates, to see if apt-sync can be worth it. For simplicity, I've done the benchmarking using the underlying zsync tool rather than apt-sync.

Background explanation: A .deb is, essentially, a very thin wrapper around a couple of tar files, which can be compressed with gzip, bzip2, or lzma. Zsync uses the rsync algorithm and some HTTP features to implement rsync without having to have an rsync server running, making it feasible to use it for all mirrors. apt-sync is an implementation of the idea that when downloading updates to a package you already have installed, there is no point in re-downloading the unchanged parts of the package.

My benchmark consists of downloading all the security updates to hardy. The goal of the benchmark is to get some figures for how good zsync (and therefore apt-sync) actually is, or can be, at downloading .debs.

Here's the summary results:

scenario        % saved     comment
-----------------------------------
plain            3.7        plain zsync, original .debs
zsyncmakeZ       3.7        use zsync gzip magic, original .debs
rsyncable       25          gzip magic, recompress . with --rsyncable
rsyncable2      33          gzip magic, compress ar, not within tar
rsyncable3      33          gzip magic, convert lzma, bzip2 to gzip
uncompressed    50          uncompress tarballs within .debs

The percent saved is the number of bytes zsync did NOT need to download, i.e., how much it reused from the previous package.

Further explanation and discussion:

  • plain: This is using plain zsync, and original .debs.

  • zsyncmakeZ: This uses the -Z option to zsyncmake when creating the .zsync file, in order to use zsync's magic gzip handling. Turns out that this doesn't help at all, compared to the plain scenario. Unless we change how the archive generates .debs, plain or zsyncmakeZ are the only options we can choose between, and they seem to be identical, and neither of them is likely to be worth the effort.

  • rsyncable: This makes it easier for zsync to do magic things with gzip, by recompressing the gzipped tarballs within the .deb files with gzip --rsyncable. This provies a lot of improvement. Saving a quarter of the bandwidth is already fairly significant, especially since the size impact on the .debs is less than 1%.

  • rsyncable2: This tests whether zsync's gzip magic works better if the gzip compression is the outermost layer. This is not a realistic option for the archive, but provides a data point for comparisons. Turns out, the differences are insignificant.

  • rsyncable3: Some packages use lzma or bzip2 compression of the tarballs within the .deb. This benchmark converts those to be compressed with gzip --rsyncable. This improves things a bit compared to just rsyncable, at a 17% increase in size compared to rsyncable. Because most of the packages using lzma are OpenOffice.org related, it is probably not realistic to make them use gzip --rsyncable due to CD size limits, but it might be possible to use them for updates that don't get put into CDs.

  • uncompressed: This uncompresses all tarballs within the .deb, to give a baseline for just how much zsync could save in an optimal situation.

I suspect the 25% value is a bit optimistic, since it comes from a rather special case: security updates don't typically change the package all that much. Backports and updates within a development cycle are likely to change the packages much more. Upgrades from release to release are also likely to change so much that zsync won't save a whole lot.

Before I continue working on this, I'd like to have some feedback on this: is a 25% reduction in download sizes worthwhile to pursue? It would seem to require changing dpkg to call the external gzip binary to use --rsyncable, rather than use the internal zlib library.

What do other people think?

PS. I've fully automated my benchmark, and am happy to share the scripts. If anyone wants to play with them, drop me a note, and I'll set up a public bzr branch. You'll need fast access to a mirror, since they download snapshots of hardy-security and the corresponding packages from hardy, for a total of about three gigabytes.