I have some sample csv files and two programs to read/filter/concat the csvs.
Here is the LazyFrame version of the code:
import os
os.environ["POLARS_MAX_THREADS"] = "1"
import polars as pl
df = pl.concat(
[
pl.scan_csv("test.csv").filter(pl.col("x3") > 0),
pl.scan_csv("test1.csv").filter(pl.col("x3") > 0),
pl.scan_csv("test2.csv").filter(pl.col("x3") > 0),
]
).collect()
The eager version replaces scan_csv
with read_csv
.
Now I would expect the LazyFrame version to perform just as well, but instead it uses more memory. (And more memory still if we increase the number of cores.) I generated the following graph with mprof
:
Is this a reliable representation of the memory usage? Will this ever be improved, or is it necessary for lazy evaluation to work this way?