Why is Pyarrow and Pandas Dataframe Compression Create Higher Memory Files Than AWS Databrew?

Question

I'm going from a dataframe to a parquet file using pyarrow or pandas dataframe function 'to_parquet' and in both of them, they have a field to specify what kind of compression you want done. The issue is when I generate the parquet files using these libraries, the file size is twice as much as the file size of the output of a job in AWS Databrew with all of the same settings attached and referencing the same input data

For pyarrow:

df = df.convert_dtypes()
stream = pa.BufferOutputStream()
table = pa.Table.from_pandas(df)
pq.write_table(table, stream, compression='SNAPPY')

For pandas dataframe:

df = df.convert_dtypes()
stream = io.BytesIO()
df.to_parquet(stream, compression='snappy',engine='pyarrow', index=False)

I've checked that the dataframe has the exact data that goes through AWS, and all of the datatypes are matching, but I'm not getting close to the same file size.

I've also tried:

pa.compress(df, codec='snappy', memory_pool=None)

with the compression in the 'to_parquet' functions set to None, but this seems to give me garbage data that AWS can't read and somehow is less than the file size I'm expecting.

Am I missing something? Do the 'to_parquet' functions actually compress the data? What kind of voodoo is AWS Databrew doing to get a file to get their magical file size? I can't seem to find good answers in Google or in documentation and feels like I'm going in circles, so any help is very appreciated. From what I've seen, it looks like AWS libraries use pyarrow for this kind of thing, so I've just been more confused why I can't seem to match file sizes

It's not clear to me what you are trying to compare against. Can you bring more context on how you generate the databrew files, and what they look like (like using `aws s3 ls`). Also it may be useful for you to compare the metadata of the parquet files using `pq.ParquetFile("foo.parquet").metadata.to_dict()` — 0x26res, Sep 21 '22 at 09:10
The databrew files come from appflow. Some transformations are applied to the data, and then it is output to S3 (Around 2.2 KB file size). The transformations are small, like renaming column or moving column order, which I have tested and doesn't impact the final file size. If I load both files into a dataframe they have the exact same number of bytes, but again the output size is different. Also stripping all the metadata from the parquet file seemed to get the file a lot closer to the right size, but I still get only about a 3.1 KB file size instead of the roughly 2.1 KB I'm expecting — user20035230, Sep 21 '22 at 21:20

Why is Pyarrow and Pandas Dataframe Compression Create Higher Memory Files Than AWS Databrew?

0 Answers0