Over-high memory usage during reading parquet in Python

Question

I have a parquet file at around 10+GB, with columns are mainly strings. When loading it into the memory, the memory usage can peak to 110G, while after it's finished the memory usage is reduced back to around 40G.

I'm working on a high-performance computer with allocated memory so I do have access to large memory. However, it seems a waste to me that I have to apply for a 128G memory just for loading data, after that 64G is sufficient for me. Also, 128G memory is more often to be out of order.

My naive conjecture is that the Python interpreter mistreated the 512G physical memory on the HPC as the total available memory, so it does not do garbage collection as often as actually needed. For example, when I load the data with 64G memory, it never threw me a MemoryError but the kernel is directly killed and restarted.

I was wondering whether the over-high usage of memory when loading is a regular behavior of pyarrow, or it is due to the special setting of my environment. If the latter, then is it possible to somehow limit the available memory during loading?

ps. I'm loading data using pd.read_parquet with pyarrow engine. — SymbolRanger, Aug 09 '19 at 19:37
What happens if you try to trigger the garbage collection (`gc.collect()`) while the data set is loading? — Ente, Aug 09 '19 at 20:02

score 2 · Answer 1 · answered Aug 09 '19 at 21:15

We fixed a memory use bug that's present in 0.14.0/0.14.1 (which is probably what you're using right now).

https://issues.apache.org/jira/browse/ARROW-6060

We also are introducing an option to read string columns as categorical (aka DictionaryArray in Arrow parlance) which also will reduce memory usage. See https://issues.apache.org/jira/browse/ARROW-3325 and discussion in

https://ursalabs.org/blog/2019-06-07-monthly-report/

Over-high memory usage during reading parquet in Python

1 Answers1