Starting with the example below:
import time
import numpy as np
import polars as pl
n_index = 1000
n_a = 10
n_b = 500
n_obs = 5000000
df = pl.DataFrame(
{
"id": np.random.randint(0, n_index, size=n_obs),
"a": np.random.randint(0, n_a, size=n_obs),
"b": np.random.randint(0, n_b, size=n_obs),
"x": np.random.normal(0, 1, n_obs),
}
).lazy()
dfs = [
pl.DataFrame(
{
"id": np.random.randint(0, n_index, size=n_obs),
"a": np.random.randint(0, n_a, size=n_obs),
f"b_{i}": np.random.randint(0, n_b, size=n_obs),
"x": np.random.normal(0, 1, n_obs),
}
).lazy()
for i in range(50)
]
res = [
df.join(
dfs[i], left_on=["id", "a", "b"], right_on=["id", "a", f"b_{i}"], how="inner"
)
.groupby(["a", "b"])
.agg((pl.col("x") * pl.col("x_right")).sum().alias(f"x_{i}"))
for i in range(50)
]
The task is really processing different dataframes, do some computations on them and then join back together all results. The code above constructs res
which contains all results as a list
.
As for joining back together the results, I tried two options as follows.
Option 1:
start = time.perf_counter()
res2 = pl.collect_all(res)
res3 = res2[0]
for i in range(1, 50):
res3 = res3.join(res2[i], on=["a", "b"])
time.perf_counter() - start
Option 2:
start = time.perf_counter()
res4 = res[0]
for i in range(1, 50):
res4 = res4.join(res[i], on=["a", "b"])
res4 = res4.collect()
time.perf_counter() - start
Option 1 does collect_all
first and then joines all individual dataframes.
Option 2 just does all things in an entirely lazy way and performs collect
at the very end.
As far as I know, collect
will do optimizations under the hood and I should expect option 1 and option 2 have similar performance. However, my benchmarking results show that option 2 takes twice as long as option 1 (21s vs. 10s on my system with 32 cores).
So, is this behavior kind of as expected? Or are there some inefficiencies about the approach I took?
One good thing about option 2 is that it is entirely lazy and it is a preferable approach in a case where we want to have an API which is entirely lazy and return a lazy dataframe and let users determine what to do next. But, from my experiment, performance is sacrificed by a lot. So, wondering if is there a way to do something like option 2 without sacrificing performance (performance comparable to option 1)?