3

Wisdom of such a use case aside... how come a machine with 512Gb RAM (and nothing else running) run out of memory while trying to save a pandas df (df.to_parquet(...)) that has object size (sys.getsizeof) "only" ~25Gb? The df is ~73mm rows by 2 columns, one of them is one English natural-language sentence, the other string id.

With engine='fastparquet', it just quits with an error about overflow; with default pyarrow it runs out of memory. In both cases, compresssion setting (default, gzip or None) does not change that respective behavior. Same result 3 times.

Tim
  • 236
  • 2
  • 8
  • 1
    `sys.getsizeof` may not give you the full picture. You should try `df.memory_usage(deep=True)`. Compression will not help the memory usage (it will only reduce the disk space of the parquet file). – 0x26res Jun 17 '22 at 07:48
  • What happens if you assign the DF to a PA table? ie `table=pa.Table.from_pandas(DF)` Does that work? – Dean MacGregor Jul 25 '22 at 15:15

0 Answers0