1

I am running a (Linux based) rsync server for software distribution. A (Windows based) source repository server which is outside my control pushes software packages to it via rsync, and about a hundred satellite servers worldwide pull from it, also via rsync.

The source repository contains many big duplicate files. I want to reduce disk space and bandwidth consumption on the satellite servers by replacing those duplicates by hardlinks. The administrator of the source repository is unwilling or unable to do so at the source, so I'm trying to do it after the fact on the distribution server. I have created a simple bash script based on the fdupes command which finds groups of duplicates and replaces them with hardlinks to a single file. The rsync transfers to the satellite servers preserve these hardlinks as desired thanks to the -H option. The transfer from the source repository however produces inconsistent results. Sometimes the deduplication is preserved. Sometimes the source server retransmits all of the files of a deduplicated group and the deduplication is broken even though the source files did not change.

Hence my question: What is the official behaviour of rsync in case it is asked to sync two identical but separate files and the files do already exist in the destination with the correct content, but as hardlinks to the same file? What is the exact criteria for retransmitting a file? Is there a way to ensure that the hardlink in the destination is preserved in that situation even though the hardlink does not exist in the source?

Tilman Schmidt
  • 4,101
  • 12
  • 27

2 Answers2

3

tl;dr: To preserve file level deduplication via hard links at the destination, run rsync with the --checksum option.

Full answer, according to a series of experiments I did:

If two files are not hardlinked at the source, rsync will sync each of them individually to the destination. It does not care whether the files happen to be hardlinked at the destination. If one of the files (or both of them) ends up being retransmitted, the hard link at the destination will be broken, otherwise it will be untouched. That is, even with the --hard-links option, rsync will not break a hardlink at the destination just because the files are not hardlinked at the source.

The criteria for retransmitting a file depend on the --checksum (-c) and --ignore-times (-I) options.

  • If the option --checksum is given, only files that differ in size or checksum between source and destination are retransmitted. Consequently, if the file content hasn't changed then a hard link at the destination will be preserved even if it doesn't exist at the source.
  • If the option --ignore-times is given, all files are retransmitted, breaking any hard link at the destination that doesn't exist at the source.
  • If neither of these two options is given, rsync will use the modification timestamps of the source and destination files for its decision. In that case, if the timestamps of the two source files differ, a hard link at the destination will always be broken because only one of the two timestamps can match.
Tilman Schmidt
  • 4,101
  • 12
  • 27
2

It preserves source hard links if you use the -H or --hard-links option

That will not create hard links -- you'll have to do that after the fact by looking for files with the same checksum, deleting one, and adding a hard link to replace it. After all, you wouldn't want rsync making every content duplicated file a hard link to the same file. Imagine if every 0 length file was a hard link -- you add content to one, you've changed the content for all.

mpez0
  • 1,512
  • 9
  • 9
  • That does not answer my question. The question is under which circumstances it preserves or breaks _destination_ hard links, specifically if the source files are _not_ hardlinked but have identical content. – Tilman Schmidt Dec 11 '20 at 09:49
  • @TilmanSchmidt "you'll have to do that after the fact by looking for files with the same checksum, deleting one, and adding a hard link to replace it" You'll also have to make sure they're on the same device – mpez0 Dec 13 '20 at 16:41
  • As explained in my question, that does not square with what I see in practice. – Tilman Schmidt Dec 14 '20 at 07:51
  • @TilmanSchmidt If the source files are different inodes (i.e., not hard links), after rsync the destination files will also be different inodes. If you want to regenerate hard links, you can rerun fdupes or your script of choice. – mpez0 Dec 14 '20 at 12:54