Why does pandas.to_parquet need so much RAM?

Asked Jun 16 '22 at 19:31

Active Jun 16 '22 at 20:11

Viewed 579 times

Wisdom of such a use case aside... how come a machine with 512Gb RAM (and nothing else running) run out of memory while trying to save a pandas df (df.to_parquet(...)) that has object size (sys.getsizeof) "only" ~25Gb? The df is ~73mm rows by 2 columns, one of them is one English natural-language sentence, the other string id.

With engine='fastparquet', it just quits with an error about overflow; with default pyarrow it runs out of memory. In both cases, compresssion setting (default, gzip or None) does not change that respective behavior. Same result 3 times.

edited Jun 16 '22 at 20:11

asked Jun 16 '22 at 19:31

Tim

1

`sys.getsizeof` may not give you the full picture. You should try `df.memory_usage(deep=True)`. Compression will not help the memory usage (it will only reduce the disk space of the parquet file). – 0x26res Jun 17 '22 at 07:48
What happens if you assign the DF to a PA table? ie `table=pa.Table.from_pandas(DF)` Does that work? – Dean MacGregor Jul 25 '22 at 15:15

Why does pandas.to_parquet need so much RAM?

0 Answers0