0

I have question. Have you ever use any tools that can fast compare the identical of huge number of files (let's say thousands files with total size up to 15 GB) in two different windows 2003 server? I want to make a test see if our backup tools are working well.

I have found Corz Checksum and Gizmo that can generate one hash value for the parent folder, but both of them take a pretty long time to processing. I am hoping I can find a more efficient tools that can use on my production server.

Thanks,

Ronin

ronin
  • 121
  • 3
  • 9

1 Answers1

5

rsync -nacv <source> <destination> will output a list of files that are different. As usual with rsync the source and destination can be local or remote.

  • The -n option does a dry run and doesn't actually transfer any files.
  • The -a option recursively checks every file and directory below the path you specify.
  • The -c option does a checksum of every file. (The default uses timestamps and sizes instead.) The checksum used is MD5 for newer versions of rsync and MD4 for older versions.
  • The -v option prints out the results.

As far as efficiency is concerned, each file needs to be completely read from the disk, the hash calculated and transferred to the destination, then the destination file read in from the disk and the hash calculated, and finally the two hashes compared... for every file. This is true of any method by any software.

The network transfer could be improved if you expect most of the files to be the same by combining more files into a single hash. The network is unlikely to be the bottleneck anyway since it only has hashes traveling across it.

rsync runs with multiple threads at both ends, so your disks should be fully utilised the whole time unless you end up CPU-bound, in which case your CPU(s) will be fully utilised.

Ladadadada
  • 26,337
  • 7
  • 59
  • 90
  • `rsync` seems to be stupid always doing CRC comparison if `-c` given, instead of using size difference as enough symptom. So, taking into consideration 15 GiB sizes, not a way to go, absolutely. – poige Feb 26 '13 at 00:33
  • If I tell `rsync` to do checksums, I expect it to do checksums. Timestamp and size combined make for a good *guess* that two files are identical, but not certainty. If close enough is good enough for you, feel free to drop the `-c` option. There's a reason timestamp plus size is the default. – Ladadadada Feb 26 '13 at 00:43
  • You obviously miss the point. And the point is there is no point to do CRC if file sizes are different. For 15 GiB files this is too dumb approach to use. – poige Feb 26 '13 at 08:30
  • I did indeed miss that point. I have no idea whether rsync already does what you suggest or not. It would also be pointless calculating hashes on files less than 32bytes in size as the hash would be larger than the file. I have no idea whether rsync does that too. It wouldn't surprise me if rsync already does both. – Ladadadada Feb 26 '13 at 09:05
  • Another issue is actually identical CRC can't be 100 % proof that contents were identical as well — due to collisions. So, accuracy would depend heavily on what hash's being used, and AFAIR, `rsync` used `MD4` and now uses `MD5` which is quite ok, but currently is thought to be not too reliable. – poige Feb 26 '13 at 09:54