4

Suppose you have a MD5 hash encoded in base64. Then each character needs only 6 bits to store each character in the resultant 22-byte string (excluding the ending '=='). Thus, each base64 md5 hash can shrink down to 6*22 = 132 bits, which requires 25% less memory space compared to the original 8*22=176 bits string.

Is there any Python module or function that lets you store base64 data in the way described above?

OTZ
  • 3,003
  • 4
  • 29
  • 41
  • 3
    Since base64 is an ASCII encoding of binary, why not just store it directly as binary? Wouldn't the be the most efficient way? (Have a look at http://docs.python.org/release/2.3/lib/module-base64.html - specifically the `decodestring` function.) – David Aug 07 '10 at 10:16
  • I am fully aware that I can just generate a md5 digest specifically for the example in the question, which is just 16 bytes long. But I'm not limiting this question to md5. It applies to all base64-encoded data. – OTZ Aug 07 '10 at 10:18
  • By the way, that documentation page is probably for the wrong version of Python...it was just one of the first results I got on Google, and I didn't check it properly. – David Aug 07 '10 at 10:18
  • @otz: Why does it matter whether it's an MD5 digest or not? **All** base64-encoded data should be convertible to a string of bytes (granted, not all of the bytes will be printable, and I don't know how Python handles NUL characters in strings - or whether it has a special byte buffer type that can handle them properly). – David Aug 07 '10 at 10:20
  • @David Thanks. I didn't realize there was a function that decodes base64 data into binary. You'd still have to add the padding at the end (if you deleted it) like base64.decodestring(md5_encode('hello')+'=='). But it does what I wanted to do. – OTZ Aug 07 '10 at 10:26
  • @David Can you post an "answer" saying "use decodestring", so that I can accept it and mark this question as 'solved'? – OTZ Aug 07 '10 at 10:29

4 Answers4

8

The most efficient way to store base64 encoded data is to decode it and store it as binary. base64 is a transport encoding - there's no sense in storing data in it, especially in memory, unless you have a compelling reason otherwise.

Also, nitpick: The output of a hash function is not a hex string - that's just a common representation. The output of a hash function is some number of bytes of binary data. If you're using the md5, sha, or hashlib modules, for example, you don't need to encode it as anything in the first place - just call .digest() instead of .hexdigest() on the hash object.

Nick Johnson
  • 100,655
  • 16
  • 128
  • 198
5

Simply decode the base64 data to binary:

>>> b64 = "COIC09jwcwjiciOEIWIUNIUNE9832iun"
>>> len(b64)
32
>>> b = b64.decode("base64")
>>> b
'\x08\xe2\x02\xd3\xd8\xf0s\x08\xe2r#\x84!b\x144\x85\r\x13\xdf7\xda+\xa7'
>>> len(b)
24
Ned Batchelder
  • 364,293
  • 75
  • 561
  • 662
4

"store base64 data"

Don't.

Do. Not. Store. Base64. Data.

Base64 is built by encoding something to make it bigger.

Store the original something. Never store the base64 encoding of something.

S.Lott
  • 384,516
  • 81
  • 508
  • 779
1

David gave an answer that works on all base64 strings.

Just use

base64.decodestring
in base64 module. That is,
import base64
binary = base64.decodestring(base64_string)

is a more memory efficient representation of the original base64 string. If you are truncating trailing '==' in your base64 md5, use it like

base64.decodestring(md5+'==')
OTZ
  • 3,003
  • 4
  • 29
  • 41