File copying and removal benchmarking

[[!tag benchmark]]

As part of my development of a backup application, I run benchmarks on it, and that means creating a test data set, running some backups, and then removing the test data set. One of the test data sets I use is a 140 gibibytes of my real data. The benchmark first copies the data to a temporary location.

In other words, a fair bit of my current life is spent waiting for files to be copied and removed. The faster that goes, the better.

Overnight, I ran a little benchmark on those operations, to compare a couple of ways to do them. The results are below:

  elapsed  cmd
      (s)
    107.2  rm -rf tmp/data
     98.4  find tmp/data -delete
    100.1  find tmp/data -exec rm -rf {} +
    116.2  find tmp/data -depth -print0 | xargs -0 rm -rf

  elapsed  cmd
      (s)
   3567.5  cp -a tmp/data tmp/copy
   3219.5  cd tmp && mkdir copy && tar -C data -cf - . | tar -C copy -xf -

It is surprising, but it's clear that find is significantly faster than rm in deleting files, by almost ten percent. Since performance is a feature, this would indicate that that feature in rm is buggy.

For file copying, the piping of two tars is a common trick, and it really is faster, again by almost ten percent.

Obviously, there might also be a problem with the benchmark. I attach the script, which uses benchmark-cmd in extrautils, which I wrote for this kind of thing. If there is a problem with the benchmark, don't hesitate to provide a patch to fix that.

There may be other ways to remove or copy files that should be compared, too. rsync? cpio? For file removal, a tool using Linux getdents directly would probably be faster than the portable code in GNU coreutils and findutils. Somebody should write that and compare.

In all cases above, the test data set to be removed or copied is 30 GiB. Copies happened to the same disk (that's what happens with my backup benchmarks too). The filesystem used was ext4.