I have a file storage server which stores files on disk using the file's sha256 hash as the filename, along with the file extension, and in three levels of directories, e.g. a PDF file with sha256 hash AABB1F1C6FC86DB2DCA6FB0167DE8CF7288798271EA24B68D857CBC5CF8DC66A
would be stored in a subdirectory like this:
<root>/AA/BB/AABB1F1C6FC86DB2DCA6FB0167DE8CF7288798271EA24B68D857CBC5CF8DC66A.pdf
Files will be added to the directory structure, but never be deleted or modified.
I keep a live copy of this file-structure using a cron job running every 10 mins that uses rsync to push the files to a remote server. Since files are never deleted or changed once added, in practice it only sends new files.
I've found that the bandwidth used by rsync just for comparing the two directories (i.e. there were no changes) is about 11 MB and increasing as the total number of files grows (148 207 at the moment). It makes sense - rsync would in effect have to send a list of all of the filenames to the remote server to figure out which are missing on the remote server.
So my question is: is there a way to reduce the bandwidth used? It doesn't have to be an rsync-based solution, but it would be preferable. I was thinking of limiting the files that rsync looks at to only recently modified files, i.e. modified after the last sync, but it seems that is not recommended: rsync only files created or modified after a date and time
Any other suggestions?