I'm seeking advice from people deeply familiar with the binary layout of Apache Parquet:
Having a data transformation F(a) = b
where F
is fully deterministic, and same exact versions of the entire software stack (framework, arrow & parquet libraries) are used - how likely am I to get an identical binary representation of dataframe b
on different hosts every time b
is saved into Parquet?
In other words how reproducible Parquet is on binary level? When data is logically the same what can cause binary differences?
- Can there be some uninit memory in between values due to alignment?
- Assuming all serialization settings (compression, chunking, use of dictionaries etc.) are the same, can result still drift?
Context
I'm working on a system for fully reproducible and deterministic data processing and computing dataset hashes to assert these guarantees.
My key goal has been to ensure that dataset b
contains an idendital set of records as dataset b'
- this is of course very different from hashing a binary representation of Arrow/Parquet. Not wanting to deal with the reproducibility of storage formats I've been computing logical data hashes in memory. This is slow but flexible, e.g. my hash stays the same even if records are re-ordered (which I consider an equivalent dataset).
But when thinking about integrating with IPFS
and other content-addressable storages that rely on hashes of files - it would simplify the design a lot to have just one hash (physical) instead of two (logical + physical), but this means I have to guarantee that Parquet files are reproducible.
Update
I decided to continue using logical hashing for now.
I've created a new Rust crate arrow-digest that implements the stable hashing for Arrow arrays and record batches and tries hard to hide the encoding-related differences. The crate's README describes the hashing algorithm if someone finds it useful and wants to implement it in another language.
I'll continue to expand the set of supported types as I'm integrating it into the decentralized data processing tool I'm working on.
In the long term, I'm not sure logical hashing is the best way forward - a subset of Parquet that makes some efficiency sacrifices just to make file layout deterministic might be a better choice for content-addressability.