I am running a (Linux based) rsync server for software distribution. A (Windows based) source repository server which is outside my control pushes software packages to it via rsync, and about a hundred satellite servers worldwide pull from it, also via rsync.
The source repository contains many big duplicate files. I want to reduce disk space and bandwidth consumption on the satellite servers by replacing those duplicates by hardlinks. The administrator of the source repository is unwilling or unable to do so at the source, so I'm trying to do it after the fact on the distribution server. I have created a simple bash script based on the fdupes
command which finds groups of duplicates and replaces them with hardlinks to a single file. The rsync transfers to the satellite servers preserve these hardlinks as desired thanks to the -H option. The transfer from the source repository however produces inconsistent results. Sometimes the deduplication is preserved. Sometimes the source server retransmits all of the files of a deduplicated group and the deduplication is broken even though the source files did not change.
Hence my question: What is the official behaviour of rsync in case it is asked to sync two identical but separate files and the files do already exist in the destination with the correct content, but as hardlinks to the same file? What is the exact criteria for retransmitting a file? Is there a way to ensure that the hardlink in the destination is preserved in that situation even though the hardlink does not exist in the source?