8

So, I have a couple of system backup image files that are around 1 terabyte, and i want to calculate fast the hash of each one of them (preferably SHA-1).

At first i tried to calculate the md5 hash, 2 hours had passed and the hash hadn't been calculated yet (something that's obvious for large files up to 1TB).

So is there any program/implementation out there that can hash a 1TB file quickly?

I have heard of Tree-Hashing that hashes parts of file simultaneously, but I haven't found any implementation so far.

Community
  • 1
  • 1
Light Flow
  • 539
  • 1
  • 6
  • 18

3 Answers3

8

If you have a 1 million MB file, and your system can read this file at 100MB/s, then

  • 1TB * 1000(TB/GB) = 1000 GB
  • 1000GB * 1000(MB/GB) = 1 million MB
  • 1 million MB/100(MB/s) = 10 thousand seconds
  • 10000s/3600(s/hr) = 2.77... hr
  • Therefore, a 100MB/s system has a hard floor of 2.77... hrs to even read the file in the first place, even before whatever additional total time may be required to compute a hash.

Your expectations are probably unrealistic - don't try to calculate a faster hash until you can perform a faster file read.

Anti-weakpasswords
  • 2,604
  • 20
  • 25
  • You are right, but i doubt my system can read the file at 100MB/s :( Are there alternative ways to read a file faster? – Light Flow Mar 31 '14 at 19:23
  • 1
    To read a file faster, you need faster storage, which mostly means you're about to spend a lot of money. Striping the backup across multiple disks in your backup software, or [RAID](http://www.newegg.com/Product/Product.aspx?Item=N82E16816151121), or [SSD](http://www.newegg.com/Product/Product.aspx?Item=9SIA29P1EC5324)'s, or SSD's in a RAID, or PCIe SSD's, or a RAMDisk; all will work. Alternately, if you're using open source backup software, have it calculate the hash while it's writing the output. I cannae change the laws of physics! – Anti-weakpasswords Mar 31 '14 at 23:13
  • Hm, yeah, it seems that this is the truth. However, i will spare some time before i accept your answer in case someone else wants to add something. – Light Flow Apr 02 '14 at 21:35
6

Old and already answered, but you may try to select specific chunks of file.

There is a perl solution i found somewhere and it that seems effective, code not mine:

#!/usr/bin/perl

use strict;
use Time::HiRes qw[ time ];
use Digest::MD5;

sub quickMD5 {
    my $fh = shift;
    my $md5 = new Digest::MD5->new;

    $md5->add( -s $fh );

    my $pos = 0;
    until( eof $fh ) {
        seek $fh, $pos, 0;
        read( $fh, my $block, 4096 ) or last;
        $md5->add( $block );
        $pos += 2048**2;
    }
    return $md5;
}

open FH, '<', $ARGV[0] or die $!;
printf "Processing $ARGV[0] : %u bytes\n", -s FH;

my $start = time;
my $qmd5 = quickMD5( *FH );
printf "Partial MD5 took %.6f seconds\n", time() - $start;
print "Partial MD5: ", $qmd5->hexdigest, "\n";

Basically the script perform MD5 on first 4KB for every 4MB block in file (actually original one did every 1MB).

  • Hm, interesting idea! However, there will be a problem if only a small part of a file is corrupted after the first 4KB of a 2MB block.. But anyway, it is something nice that I didn't know!! Thanks for sharing!! :) – Light Flow May 26 '16 at 14:12
  • I usually do md5 of 3,4TB files that i DAILY copy to external storage. By now i had no surprises, btw with standard NAS performances full MD5, also given copy time, is not an option for me! Also consider you're doing MD5 of more than 250000/300000 data blocks, that SHOULD be acceptably safe with so large files. –  May 26 '16 at 15:35
  • How do I use this script to check the complete filesystem and write the results to a file instead of check just one file? – Sebastian Roy Jun 12 '19 at 10:27
  • You'll just need to call it externally (ie with a bash script). Can't code right now but if you pipe the output of a "find" command on relevant folders of filesystem to the perl script should do the trick. in your bash script you should have something like "find /folder/ -type f | xargs perl_script" and redirect a to log file. –  Jun 13 '19 at 11:20
0

I suggest you take a look at the non-cryptographic hashes (ex: xxhash and murmur3) they are much faster than md5 till of course you reach the max read speed.

gmansour
  • 889
  • 8
  • 8