I ran a few experiments where I saved a DataFrame of random integers to parquet with brotli compression. One of my tests was to find the size ratio between storing as 32-bit integers vs 64-bit:
df = pd.DataFrame(
np.random.randint(0, 10000000, size=(1000000, 4)), columns=["a", "b", "c", "d"]
)
df.astype("Int32").to_parquet("/tmp/i32.parquet", compression="brotli")
i32_size = int(Path("/tmp/i32.parquet").stat().st_size)
df.astype("Int64").to_parquet("/tmp/i64.parquet", compression="brotli")
i64_size = int(Path("/tmp/i64.parquet").stat().st_size)
print(i64_size / i32_size)
I expected this to output some number > 1, since I expect INT64 to be larger than INT32, but in fact I get ~0.96. Why is that?
I've checked with parquet-tools
and the files are definitely saved as INT32 and INT64, respectively. If I try with gzip compression instead, I do get a ratio > 1.