3

I have (3) md5sums that I need to combine into a single hash. The new hash should be 32-characters, but is case-sensitive and can be any letter or number. What's the best way to do this in Python?

ensnare
  • 40,069
  • 64
  • 158
  • 224
  • Are the three hashes an ordered set, or an unordered set ? In other words: may `H(a,b,c) == H(b,c,a)` ? Maybe it even *should* produce the same hashvalue? – wildplasser Jan 25 '12 at 20:37
  • 2
    Forget about "unique" - you're trying to squeeze 96 characters into 32. The best you can hope for is "astronomically unlikely to collide". – Mark Ransom Jan 25 '12 at 20:37
  • 2
    @Mark Ransom: well, it's a little better than that. He's trying to squeeze 384 bits into ~190.5, because he's moving from 16 letters to 62. – DSM Jan 25 '12 at 20:40
  • @wildplasser: This is an ordered set – ensnare Jan 25 '12 at 20:41
  • In that case: concatenate the strings and calculate a hash for the concatenated string (as Mark Ransom proposed below). In the unordered case, you could first *order* the three hashes (eg alphabetically, anything goes, as long as it leads to a canonical form) – wildplasser Jan 25 '12 at 20:45
  • does it need to be reversible, or just one-way? if the former, can the characters be unichars, i.e., ord(c) > 255? that would be the only way in that case. – jcomeau_ictx Jan 25 '12 at 20:50
  • What's the best way to concatenate this new string so that it's 32-char alphanumeric case sensitive? – ensnare Jan 25 '12 at 20:51
  • @jcomeau_ictx: one way is fine! – ensnare Jan 25 '12 at 20:51
  • ok then, you already have some good answers. – jcomeau_ictx Jan 25 '12 at 20:52

4 Answers4

5

I would start by combinind the md5 hashes into a single hash. You can use SHA256 since it will contain more bytes in the end:

>>> import hashlib
>>> combined = hashlib.sha256()
>>> combined.update(hashlib.md5('test1').digest())
>>> combined.update(hashlib.md5('test2').digest())
>>> combined.update(hashlib.md5('test3').digest())

Then you can use base64 to encode it using letters, numbers, and a few extra symbols:

>>> import base64
>>> base64.b64encode(combined.digest())
'PeFC3irNFx8fuzwjAz+fE/up9cz6xujs2Z06IH2GdUM='

If you want just 32 characters long, slice off the last bits:

>>> base64.b64encode(combined.digest())[:32]
'PeFC3irNFx8fuzwjAz+fE/up9cz6xujs'

This can contain + and / in addition to letters and numbers like your OP suggests. If you want to replace them, you can use the second parameter to b64encode:

>>> base64.b64encode(combined.digest(), altchars="AA")[:32]
'PeFC3irNFx8fuzwjAzAfEAup9cz6xujs'
jterrace
  • 64,866
  • 22
  • 157
  • 202
2

The easiest way would be to combine the 3 sums into a single 96-character string and run an MD5 hash on that.

Mark Ransom
  • 299,747
  • 42
  • 398
  • 622
  • OP says result should be *any* letter and number, case sensitive. The result would be 32-chars of hex. – jterrace Jan 25 '12 at 20:44
1
>>> from hashlib import md5
>>> import base64
>>> hashes = [md5(str(i)).hexdigest() for i in range(3)]
>>> hashes
['cfcd208495d565ef66e7dff9f98764da', 'c4ca4238a0b923820dcc509a6f75849b', 'c81e728d9d4c2f636f067f89cc14862c']
>>> base64.b64encode(md5(''.join(hashes)).hexdigest())[:32]
'YTg2N2M3N2U0Mzg2YjY1YWY4NzYzOWZh'
tMC
  • 18,105
  • 14
  • 62
  • 98
  • as jterrace pointed out- base64 can contain + and / chars. You can change the 2 extra chars it uses. http://docs.python.org/library/base64.html – tMC Jan 25 '12 at 20:53
1

Just for yet another way, using "characters" to mean any Unicode codepoint, here's what I came up with, including my bumbling around:

>>> hashes = ['96a77af1cce6dc64ed5d4c381bb7f143',
...  '11b13de4792e0407aae4a40fd6e4e2d4',
...  'eec7e31c5e2890adaf0d999835c976fc',
... ]
>>> int(''.join(hashes), 16)
23187806638669244987192443940605368881272088351426889142645412473142674081465702767335075936780031545889279263209212L
>>> n=_
>>> (48 * 8) / 32  # calculating bits per character
12
>>> 1 << 12
4096
>>> chars = []
>>> for i in range(32):
...  chars.append(unichr(n % 4096))
...  n /= 4096
... 
>>> chars
[u'\u06fc', u'\u0c97', u'\u0835', u'\u0999', u'\u0f0d', u'\u0ada', u'\u0890', u'\u05e2', u'\u031c', u'\u0c7e', u'\u04ee', u'\u0e2d', u'\u06e4', u'\xfd', u'\u04a4', u'\u0aae', u'\u0407', u'\u02e0', u'\u0479', u'\u03de', u'\u01b1', u'\u0431', u'\u07f1', u'\u01bb', u'\u0c38', u'\u05d4', u'\u04ed', u'\u0dc6', u'\u0ce6', u'\u0f1c', u'\u077a', u'\u096a']
>>> ''.join(chars)
u'\u06fc\u0c97\u0835\u0999\u0f0d\u0ada\u0890\u05e2\u031c\u0c7e\u04ee\u0e2d\u06e4\xfd\u04a4\u0aae\u0407\u02e0\u0479\u03de\u01b1\u0431\u07f1\u01bb\u0c38\u05d4\u04ed\u0dc6\u0ce6\u0f1c\u077a\u096a'
>>> print _
ۼಗ࠵ঙ།૚࢐ע̜౾ӮอۤýҤમЇˠѹϞƱб߱ƻసהӭෆ೦༜ݺ४

I would probably have had to use 13 bits per character to avoid any punctuation, but I didn't want to invest the time since you didn't care about reversibility anyway.

[later] nope, didn't have to:

>>> hashes = ['96a77af1cce6dc64ed5d4c381bb7f143',
...  '11b13de4792e0407aae4a40fd6e4e2d4',
...  'eec7e31c5e2890adaf0d999835c976fc',
... ]
>>> charlist = filter(lambda c: c.isalnum(), map(unichr, range(8000)))
>>> len(charlist)
5032
>>> n = int(''.join(hashes), 16)
>>> n
23187806638669244987192443940605368881272088351426889142645412473142674081465702767335075936780031545889279263209212L
>>> chars = []
>>> for i in range(32):
...  chars.append(charlist[n % 4096])
...  n /= 4096
... 
>>> chars
[u'\u0b67', u'\u1448', u'\u0dc5', u'\u10f4', u'\u16cf', u'\u124a', u'\u0ea7', u'\u0931', u'\u0442', u'\u142f', u'\u06c7', u'\u15de', u'\u0b26', u'\u0178', u'\u067d', u'\u121d', u'\u0542', u'\u0406', u'\u0638', u'\u050c', u'\u022c', u'\u0575', u'\u0d6b', u'\u0236', u'\u13dd', u'\u0923', u'\u06c6', u'\u1577', u'\u1497', u'\u16de', u'\u0c87', u'\u10bb']
>>> print ''.join(chars)
୧ᑈළჴᛏቊວऱтᐯۇᗞଦŸٽምՂІظԌȬյ൫ȶᏝणۆᕷᒗᛞಇႻ
jcomeau_ictx
  • 37,688
  • 6
  • 92
  • 107