4

I ran a few experiments where I saved a DataFrame of random integers to parquet with brotli compression. One of my tests was to find the size ratio between storing as 32-bit integers vs 64-bit:

df = pd.DataFrame(
    np.random.randint(0, 10000000, size=(1000000, 4)), columns=["a", "b", "c", "d"]
)

df.astype("Int32").to_parquet("/tmp/i32.parquet", compression="brotli")
i32_size = int(Path("/tmp/i32.parquet").stat().st_size)

df.astype("Int64").to_parquet("/tmp/i64.parquet", compression="brotli")
i64_size = int(Path("/tmp/i64.parquet").stat().st_size)

print(i64_size / i32_size)

I expected this to output some number > 1, since I expect INT64 to be larger than INT32, but in fact I get ~0.96. Why is that?

I've checked with parquet-tools and the files are definitely saved as INT32 and INT64, respectively. If I try with gzip compression instead, I do get a ratio > 1.

A. Rocke
  • 41
  • 2
  • What are the sizes, and what do you get with `compression=None`? – Kelly Bundy Sep 21 '20 at 19:20
  • @HeapOverflow 15917181 and 15360289 for 32 and 64-bit, respectively. With no compression, I get 18478908 and 33167196, so that 64-bit is around 1.8 times larger (as expected). – A. Rocke Sep 21 '20 at 19:32
  • So without compression, 32-bit has 18478908-16000000=2,478,908 bytes overhead while 64-bit has only 33167196-32000000=1,167,196 bytes overhead. Is that expected? – Kelly Bundy Sep 21 '20 at 19:42
  • I'm not sure, I suppose it's part of my question. Based on their docs, parquet does attempt to store integers in an efficient representation (and all of mine are small enough to be 32-bit), but I'm not sure of the details. When I say 'as expected', I just mean that the ratio is > 1. Since it's also > 1 under gzip, I had assumed it was more about brotli than parquet's own representation. Or possibly some synergistic combination of the two. – A. Rocke Sep 21 '20 at 20:03
  • I don't think this example is relevant since the generated data isn't using the full range of int64 and parquet can take advantage of that when compressing. When generating the data for int64 you should use `np.random.randint(0, 2**63-1, size=...)` and for int32 `np.random.randint(0, 2**31-1, size=...)` – 0x26res Sep 23 '20 at 09:29
  • @Arthur What would that tell us about *this*? – Kelly Bundy Sep 23 '20 at 12:14
  • that parquet or the compression algo is smart enough to know that the int64 can be encoded as int32 – 0x26res Sep 23 '20 at 12:28
  • @Arthur How would that explain why the 64-bit result is *smaller* than the 32-bit result? – Kelly Bundy Sep 23 '20 at 12:42
  • how about this: with int64 there are a lot of consecutive bits set to zero. So if they use run length encoding they might achieve better compression. – 0x26res Sep 24 '20 at 08:46
  • @Arthur That doesn't explain it. The same could then be done with int32 as well, and the run lengths would be smaller. (Btw please don't omit notifications.) – Kelly Bundy Sep 27 '20 at 20:01

0 Answers0