5

I have got cached instance of hasher:

m1 = hashlib.md5()
m1.update(b'very-very-long-data')
cached_sum = m1

and I would like to update external hasher with a sum cached before:

def append_cached_hash(external_hasher):
    # something like this
    external_hasher.update(cached_sum)

Unfortunately, it does not work as update() expects bytes. I could pass the same 'very-very-long-data' bytes again, but it refuses the whole idea of pre-caching md5 sum for common long-data object.

I could do something like the following:

external_hasher.update(cached_sum.hexdigest())

However, it does not produce the same needed result as:

external_hasher.update(b'very-very-long-data')

How could I implement the function above?


The same problem can be formulated differently. There are 3 big data sets and it is necessary to calculate md5 sums using python for all possible combinations. It is allowed to calculate md5 once for each data source.

m1 = hashlib.md5(b'very-big-data-1')
m2 = hashlib.md5(b'very-big-data-2')
m3 = hashlib.md5(b'very-big-data-3')

What should I write in the second parameter of the following print functions to achieve the goal?

print("sum for data 1 and data 2 is:", m1.update(m2))
print("sum for data 1 and data 3 is:", m1.update(m3))
print("sum for data 2 and data 3 is:", m2.update(m3))
print("sum for data 1, data 2 and data 3 is:", m1.update(m2.update(m3)))

Thanks in advance for your help!

Andrew
  • 2,055
  • 2
  • 20
  • 27

1 Answers1

5

A hashing function is a one way function that eats a variable length sequence of bytes and produces a fixed length sequence, a hash. So hashlib implementation goes along with this and doesn't provide a way of pulling out the input sequence, at least not a clear one.

IMO it also makes sense from the OOP perspective in that such a hash object represents a hash, so it could be used in it's place and passed around without unauthorized code being able to read the original input. Not sure if hashlib objects are really that secure though.

So to calculate all the combinations you need to keep the datasets available and use them directly. You can use the hash.copy method to reuse partial hashing results though, as advised in the docs:

hash.copy()

Return a copy (“clone”) of the hash object. This can be used to efficiently compute the digests of strings that share a common initial substring.

import hashlib

d1 = 'data-1'
d2 = 'data-2'
d3 = 'data-3'

h1 = hashlib.md5(d1)
# instead of hashlib.md5(d1).update(d2), or hashlib.md5(d1 + d2)
h12 = h1.copy()
h12.update(d2)
# instead of hashlib.md5(d1).update(d3), or hashlib.md5(d1 + d3)
h13 = h1.copy()
h13.update(d3)

h2 = hashlib.md5(d2)
# instead of hashlib.md5(d2).update(d1), or hashlib.md5(d2 + d1)
h21 = h2.copy()
h21.update(d1)

# ...

What about hashing a sum of the partial hashes, would that be of use to you?

Community
  • 1
  • 1
famousgarkin
  • 13,687
  • 5
  • 58
  • 74
  • thanks, copy() allows to minimize number of update() function over big data sets. However, it still requires double pass over d2 and d1 in your answer :( this becomes more performance inefficient if there are more than 3 data sets. I am looking into zlib.crc32 which allows to calculate checksums and sum them as numbers, but I would like to use something better (more bits at least) to avoid collisions... Could you recommend and alternative? – Andrew Jul 11 '14 at 09:13
  • 1
    @Andrew I see. I think there's no way around that. Hashing big files for example is also done by reading chunks and hashing incrementally. What's the purpose of these hashes? CRC cannot be used in place of cryptographic hashes just like that, it won't detect intentional data changes and it is reversible. What about hashing a sum of the partial hashes, would that be of use to you? – famousgarkin Jul 11 '14 at 18:02
  • Yes, hashing of a sum of partial hashes works for me. Thank you! – Andrew Jul 11 '14 at 19:49
  • @Andrew Added the hash of hashes suggestion to the answer, as it solved the problem. I'll be glad if you accept it now, thank you! – famousgarkin Jul 19 '14 at 10:35