2

My goal is to feed an object that supports the buffer protocol into hashlib's sha2 generator such that sha2 hashes generated from the same underlying data in different execution environments are consistent, and so can be used for equality tests.

I would like this to work for arbitrary data types without having to write a bunch of boilerplate wrappers around bytes() or bytearray(), ie, one function I can pass strings (with encoding), numerics, and bools. Extra points if I can get the memory layout for a complex type like a dict or list.

I am looking at struct, as well as doing something like loading the data into a pandas DataFrame and then using Apache Arrow to access the memory layout directly.

Looking for guidance as to the most "pythonic" way to accomplish this.

Alex Flanagan
  • 557
  • 4
  • 9
  • 2
    "I would like this to work for arbitrary data types" - not going to work. How would you hash a pipe, or a database connection, or other types that don't really represent *data*? Also, even for stuff that does represent data, there's no general API for this. – user2357112 Dec 01 '20 at 18:49
  • @user2357112supportsMonica I'm fine restricting to types that hold data and are combinations of primitives. Point taken that there's no general API for this -- that's what I'm validating to make sure the answer isn't something like, "oh just call `mem_repr(x)`" – Alex Flanagan Dec 01 '20 at 18:53
  • 2
    (Before someone suggests it, no, serializing objects with `pickle` is not the answer. Equal objects may produce unequal pickles. For example, `{0, 16}` and `{16, 0}` will produce different pickles on current CPython.) – user2357112 Dec 01 '20 at 18:58
  • https://docs.ray.io/en/master/serialization.html, particularly this part: ```Ray has decided to use a customed Pickle protocol version 5 backport to replace the original PyArrow serializer. This gets rid of several previous limitations (e.g. cannot serialize recursive objects). Ray is currently compatible with Pickle protocol version 5, while Ray supports serialization of a wilder range of objects (e.g. lambda & nested functions, dynamic classes) with the support of cloudpickle.``` Has given me some ideas. – Alex Flanagan Dec 01 '20 at 19:23
  • 1
    Oh hey, it's one of the projects actually using the out-of-band data feature! Note that that actually takes it even further away from your goal, though. Creating something usable as a hash input is not one of the goals of Ray's serialization. – user2357112 Dec 01 '20 at 19:32
  • Ahh, I was just going to say that Plasma looks like it does what I want :) – Alex Flanagan Dec 01 '20 at 19:33

1 Answers1

0

hashlib.sha256(bytes(struct.pack('!f', 12.3))).hexdigest())

Repeat for all native types.

Alex Flanagan
  • 557
  • 4
  • 9