Wisdom of such a use case aside... how come a machine with 512Gb RAM (and nothing else running) run out of memory while trying to save a pandas df (df.to_parquet(...)
) that has object size (sys.getsizeof
) "only" ~25Gb? The df is ~73mm rows by 2 columns, one of them is one English natural-language sentence, the other string id.
With engine='fastparquet'
, it just quits with an error about overflow; with default pyarrow
it runs out of memory. In both cases, compresssion
setting (default, gzip
or None
) does not change that respective behavior. Same result 3 times.