Duplicate data in filesystems

[[!tag obnam]]

For my backup program, I wish to store a chunk of content only once, regardless of how many times it appears in the filesystem. This can easily be done by having a way of looking up chunks via checksums. The lookup has some time and space overhead: the smaller the chunk, the more chunks there are, and the higher the overhead. On the other hand, intuition says that the smaller the chunk, the more likely it is to have a duplicate, and the more space can be saved.

Is that true? Only one way to figure out.

I wrote a little program to compute an MD5 checksum for each chunk of a given size, at given offsets. For example, 4096 byte chunks at 1024 byte offsets (chunks overlap). Then I ran this program on a snapshot of my laptop's home directory.

In the above plot, it looks to me like the size of the offset matters only if it is very small (up to about 4 KiB). However, the size of the chunk matters fairly much. Luckily there seems to be a large bump at 128 KiB. It is lucky because it is a pretty large chunk, so there are few of them, so the lookup overhead is much smaller.

Or possibly I should learn some statistics.

I have yet no idea whether this results in something useful for the actual backup program. It was just a small side project.