[[!tag idea]]

During lunch the other day, I discussed the shortcomings of the tar file format with friend and co-worker Daniel. The tar file format has a lot of legacy by now, and it's not quite up to date with the latest developments in file systems, such as extended attributes. This makes tar badly suited for things such as backups and other situations where precise reproduction of the input data matters.

There are several variants of the tar file format, and various more or less standard extensions to it. GNU tar, for example, added support for pathnames longer than 100 bytes many years ago, and it is now commonly supported.

Other problems in the tar file format:

  • It has no native support for compression. The Unix Way is to use an external compressor, which is nice, but it makes it necessary to decompress the entire file to get a list of its contents. For large archives, this is very time consuming.
  • Even when uncompressed, the file format works badly for some kinds of operations, such as deleting files from the archive, or updating them with new versions.
  • The file format is entirely linear. When creating a tar file, it would sometimes be possible to write data from multiple sources at the same time, perhaps compressing them separately, maybe with file type specific compressors. With a linear format, this is not possible without spooling some files into temporary files. An interleaved format, similar to multimedia files, which mix audio and video data into a single stream, would make it possible to be more efficient at writing.
  • The supported meta data for files is limited, and it's hard to extend the support without breaking the file format.

This led us to discuss the possibility of a new file format. We had a bit of fun exploring the solution space for a while.

However, almost all use of tar these days is for distributing sets of files, where the filename and basic set of file permissions is enough. In other words, for things such as source code, tar is just fine. The archives are small enough, and the other limitations are rarely a problem, but the pain of switching to a new format would be great. Thus, with some reluctance, we concluded that a new format would be a waste of time.

But I thought I'd write this up anyway, in case one of my readers wants to start working on this.