Is there a way to reduce resource usage when reading and writing large dataframes with polars?

Question

For my specific problem I have been converting ".csv" files to ".parquet" files. The CSV files on disk are about 10-20 GB each.

Awhile back I have been using ".SAS7BDAT" files of similar size to convert to ".parquet" files of similar data but now I get them in CSVs so this might not be a good control, but I used the pyreadstat library to read these files in (with multi-threading on in the parameter, which didn't make a difference for some reason) and pandas to write. It was also a tiny bit faster but I feel the code ran on a single thread, and it took a week to convert all my data.

This time, I tried the polars library and it was blazing fast. The CPU usage was near 100%, memory usage was also quite high. I tested this on a single file which would have taken hours, only to complete in minutes. The problem is that it uses too much of my computer's resources and my PC stalls. VSCode has crashed on some occasions. I have tried passing in the low memory parameter but it still uses a lot of resources. My suspicion is with the "reader.next_batches(500)" variable but I don't know for sure.

Regardless, is there a way to limit the CPU and memory usage while running this operation so I can at least browse the internet/listen to music while this runs in the background? With pandas the process is too slow, with polars the process is fast but my PC becomes unusable at times. See image for the code I used.

Thanks.

I tried the low memory parameter with polars but memory usage was still quite high. I was expecting to at least use my PC while this worked in the background. My hope is to use 50-80% of my PC's resources such that enough resources are free for other work while the files are being converted.

Have you tried setting the `POLARS_MAX_THREADS` environment variable before importing `polars`? Perhaps leaving a CPU core for other tasks might be sufficient. — rickhg12hs, Feb 16 '23 at 02:07

score 0 · Answer 1 · answered Feb 17 '23 at 18:59

0

I see you're on Windows so convert your notebook into a py script then from the command line run

start /low python yourscript.py

And/or use task manager to lower the priority of your python process once it's running.

answered Feb 17 '23 at 18:59

Dean MacGregor

11,847
9
34
72

Is there a way to reduce resource usage when reading and writing large dataframes with polars?

1 Answers1