[[!tag benchmark]]
As part of my development of a backup application, I run benchmarks on it, and that means creating a test data set, running some backups, and then removing the test data set. One of the test data sets I use is a 140 gibibytes of my real data. The benchmark first copies the data to a temporary location.
In other words, a fair bit of my current life is spent waiting for files to be copied and removed. The faster that goes, the better.
Overnight, I ran a little benchmark on those operations, to compare a couple of ways to do them. The results are below:
elapsed cmd
(s)
107.2 rm -rf tmp/data
98.4 find tmp/data -delete
100.1 find tmp/data -exec rm -rf {} +
116.2 find tmp/data -depth -print0 | xargs -0 rm -rf
elapsed cmd
(s)
3567.5 cp -a tmp/data tmp/copy
3219.5 cd tmp && mkdir copy && tar -C data -cf - . | tar -C copy -xf -
It is surprising, but it's clear that find is significantly faster than rm in deleting files, by almost ten percent. Since performance is a feature, this would indicate that that feature in rm is buggy.
For file copying, the piping of two tars is a common trick, and it really is faster, again by almost ten percent.
Obviously, there might also be a problem with the benchmark. I attach
the script, which uses benchmark-cmd
in
extrautils, which I wrote for this kind
of thing. If there is a problem with the benchmark, don't hesitate
to provide a patch to fix that.
There may be other ways to remove or copy files that should be compared,
too. rsync? cpio? For file removal, a tool using Linux getdents
directly
would probably be faster than the portable code in GNU coreutils and
findutils. Somebody should write that and compare.
In all cases above, the test data set to be removed or copied is 30 GiB. Copies happened to the same disk (that's what happens with my backup benchmarks too). The filesystem used was ext4.