LazyFrame memory usage (polars.scan_csv vs polars.read_csv, single threaded)

Question

I have some sample csv files and two programs to read/filter/concat the csvs.
Here is the LazyFrame version of the code:

import os

os.environ["POLARS_MAX_THREADS"] = "1"

import polars as pl

df = pl.concat(
    [
        pl.scan_csv("test.csv").filter(pl.col("x3") > 0),
        pl.scan_csv("test1.csv").filter(pl.col("x3") > 0),
        pl.scan_csv("test2.csv").filter(pl.col("x3") > 0),
    ]
).collect()

The eager version replaces scan_csv with read_csv.

Now I would expect the LazyFrame version to perform just as well, but instead it uses more memory. (And more memory still if we increase the number of cores.) I generated the following graph with mprof:

Is this a reliable representation of the memory usage? Will this ever be improved, or is it necessary for lazy evaluation to work this way?

ritchie46 · Accepted Answer · 2022-09-17T16:02:28.630

A lazy concat will parallelize the work over its inputs. This might give a bit more memory usage than the sequential reads in eager. That's why you see it increase when you use more cores.

The predicates are pushed down to the scan level. If your memory usage does not drop because of that, you probably don't filter out many rows. Because we want memory to stay low if we DO filter out many rows, a lazy reader works on smaller chunks and probably has more heap fragmentation.

lazy does not optimize for these micro benchmarks, but looks at a longer query at a whole. When you start selecting subsets of columns/rows (either directly or by grouping), slicing, filtering, lazy will do a lot less work and use much less memory than eager.

LazyFrame memory usage (polars.scan_csv vs polars.read_csv, single threaded)

1 Answers1