3

I don't know so much in hash algorithms.

I need to compute the hash of an incoming file live in Java before forwarding the file a remote system (a bit like S3) which requires a file hash in MD2/MD5/SHA-X. This hash is not computed for security reasons but simply for a consistency checksum.

I am able to compute this hash live while forwarding the file, with a DigestInputStream of Java standard library, but would like to know which algorithm is the best to use to avoid performance problems of using the DigestInputStream?

One of my former collegue tested and told us that computing the hash live can be quite expensive compared to an unix command line or on a file.


Edit about premature optimization: I work an a company which targets to help other companies to dematerialize their documents. This means we have a batch which handle document transfers from other companies. We target in the future millions of document per days and actually, the execution time of this batch is sensitive for our business.

An hashing optimisation of 10 milliseconds for 1 million document per day is a daily execution time reduced of 3 hours which is pretty huge.

Sebastien Lorber
  • 89,644
  • 67
  • 288
  • 419
  • 2
    You should be able to hash more than 100MB/s on a decent machine using a single core, so unless you're using gigabit internet, it shouldn't really be a bottleneck. – CodesInChaos Oct 03 '13 at 11:00
  • 3
    Premature optimization is the root of all evil. I definitely think that you should choose a hash that is technically sufficient for what you try to achieve, and if it **proves** to have performance issues, make changes accordingly... – ppeterka Oct 03 '13 at 11:01
  • If you *really* need no security, then MD5 is an okay choice. But if you can afford the performance hit, go with SHA-2 (either SHA-256 or SHA-512) – CodesInChaos Oct 03 '13 at 11:13
  • @CodesInChaos I tries using the MessageDigest on 80mb files and it seems to take ~300ms more to consume the InputStream. – Sebastien Lorber Oct 03 '13 at 12:29
  • @ppeterka66 it is not because I don't give the whole context that you can say things like that. For your informations, this question could lead to en enhancement of a batch that handles a lot of files. The file hashing step of a batch could take up to 20 minutes per file chunk, so reducing the hashing time could lead to an execution time of 20% of this batch, which is sensitive for our business case – Sebastien Lorber Oct 03 '13 at 12:32
  • 1
    @SebastienLorber With that number (260MB/s) hashing should only limit you if you have a 2Gb/s network connection. If it's really a limitation, you could switch to native code. Native MD5 should be somewhere between 500 and 1000 MB/s. – CodesInChaos Oct 03 '13 at 12:56
  • @CodesInChaos this is what we use actually, but for legacy reasons I won't explain, using the native code on a whole folder forces gives nice performance but forces us to hash each file with the 5 different algorithms. – Sebastien Lorber Oct 03 '13 at 13:00

3 Answers3

5

If you simply want to detect accidental corruption during transmission, etc, then a simple (non-crypto) checksum should be sufficient. But note that (for example) a 16 bit checksum will fail to detect random corruption one time in 216. And it is no guard against someone deliberately modifying the data.

The Wikipedia page on Checksums, lists various options including a number of commonly used (and cheap) ones like Adler-32 and CRCs.

However, I agree with @ppeterka. This smells of "premature optimization".

Stephen C
  • 698,415
  • 94
  • 811
  • 1,216
1

I know that lot of people do not believe in micro benchmark but let me post the result what I've got.

Input:

bigFile.txt = appx 143MB size

hashAlgorithm = MD2, MD5, SHA-1

test code:

       while (true){
            long l = System.currentTimeMillis();
            MessageDigest md = MessageDigest.getInstance(hashAlgorithm);
            try (InputStream is = new BufferedInputStream(Files.newInputStream(Paths.get("bigFile.txt")))) {
                DigestInputStream dis = new DigestInputStream(is, md);
                int b;
                while ((b = dis.read()) != -1){
                }
            }
            byte[] digest = md.digest();
            System.out.println(System.currentTimeMillis() - l);
        }

results:

MD5
------
22030
10356
9434
9310
11332
9976
9575
16076
-----

SHA-1
-----
18379
10139
10049
10071
10894
10635
11346
10342
10117
9930
-----

MD2
-----
45290
34232
34601
34319
-----

Seems that MD2 a bit slower that MD5 or SHA-1

nkukhar
  • 1,975
  • 2
  • 18
  • 37
  • 1
    Thanks, but reading byte by byte gives bad performance results. I can read that file in 200ms without hash and in 300ms with MD5 which seems to give the best results – Sebastien Lorber Oct 03 '13 at 13:02
  • 1
    And yet MD2, MD5, SHA-1, or any cryptographic checksum is the wrong tool for the job. You are measuring the acceleration of a dump truck for suitability in an Indy car race in your microbenchmark. – President James K. Polk Oct 03 '13 at 13:08
  • @GregS can you explain, what do you mean? – Sebastien Lorber Oct 03 '13 at 13:24
  • @SebastienLorber: Your question indicates that you're looking to detect accidental file corruption rather than intentional file manipulation. Checksums like Adler-32, or CRC's (see Stephen C's answer) are immensely faster and more appropriate that MD-x or SHA-x. – President James K. Polk Oct 03 '13 at 14:16
  • actually the remote host to which we send the files do the hash check (I think this is a legal thing in french dematerialization norms) and do not support checksums algorithms – Sebastien Lorber Oct 03 '13 at 14:42
1

Like NKukhar I've tried to do a micro-benchmark, but with a different code and better results:

public static void main(String[] args) throws Exception {
    String bigFile = "100mbfile";


    // We put the file bytes in memory, we don't want to mesure the time it takes to read from the disk
    byte[] bigArray = IOUtils.toByteArray(Files.newInputStream(Paths.get(bigFile)));
    byte[] buffer = new byte[50_000]; // the byte buffer we will use to consume the stream

    // we prepare the algos to test
    Set<String> algos = ImmutableSet.of(
            "no_hash", // no hashing
            MessageDigestAlgorithms.MD5,
            MessageDigestAlgorithms.SHA_1,
            MessageDigestAlgorithms.SHA_256,
            MessageDigestAlgorithms.SHA_384,
            MessageDigestAlgorithms.SHA_512
    );

    int executionNumber = 20;

    for ( String algo : algos ) {
      long totalExecutionDuration = 0;
      for ( int i = 0 ; i < 20 ; i++ ) {
        long beforeTime = System.currentTimeMillis();
        InputStream is = new ByteArrayInputStream(bigArray);
        if ( !"no_hash".equals(algo) ) {
          is = new DigestInputStream(is, MessageDigest.getInstance(algo));
        }
        while ((is.read(buffer)) != -1) {  }
        long executionDuration = System.currentTimeMillis() - beforeTime;
        totalExecutionDuration += executionDuration;
      }
      System.out.println(algo + " -> average of " + totalExecutionDuration/executionNumber + " millies per execution");
    }
  }

This produces the following output for a 100mb file on a good i7 developer machine:

no_hash -> average of 6 millies per execution
MD5 -> average of 201 millies per execution
SHA-1 -> average of 335 millies per execution
SHA-256 -> average of 576 millies per execution
SHA-384 -> average of 481 millies per execution
SHA-512 -> average of 464 millies per execution
Sebastien Lorber
  • 89,644
  • 67
  • 288
  • 419