Example setup
Warning: 5gb memory df creation
import time
import numpy as np
import polars as pl
rng = np.random.default_rng(1)
nrows = 50_000_000
df = pl.DataFrame(
dict(
id=rng.integers(1, 50, nrows),
id2=rng.integers(1, 500, nrows),
v=rng.normal(0, 1, nrows),
v1=rng.normal(0, 1, nrows),
v2=rng.normal(0, 1, nrows),
v3=rng.normal(0, 1, nrows),
v4=rng.normal(0, 1, nrows),
v5=rng.normal(0, 1, nrows),
v6=rng.normal(0, 1, nrows),
v7=rng.normal(0, 1, nrows),
v8=rng.normal(0, 1, nrows),
v9=rng.normal(0, 1, nrows),
v10=rng.normal(0, 1, nrows),
)
)
I have a simple task on hand as follows.
start = time.perf_counter()
res = (
df.lazy()
.with_columns(
[
pl.col(f"v{i}") - pl.col(f"v{i}").mean().over(["id", "id2"])
for i in range(1, 11)
]
)
.groupby(["id", "id2"])
.agg([(pl.col(f"v{i}") * pl.col("v")).sum() for i in range(1, 11)])
.collect()
)
time.perf_counter() - start
# 9.85
This task above completes in ~10s on a 16-core machine.
However, if I first split/partition the df
by id
and then perform the same calculation as above and call collect_all
and concat
at the end, I can get a nearly 2x speedup.
start = time.perf_counter()
res2 = pl.concat(
pl.collect_all(
[
dfi.lazy()
.with_columns(
[
pl.col(f"v{i}") - pl.col(f"v{i}").mean().over(["id", "id2"])
for i in range(1, 11)
]
)
.groupby(["id", "id2"])
.agg([(pl.col(f"v{i}") * pl.col("v")).sum() for i in range(1, 11)])
for dfi in df.partition_by("id", maintain_order=False)
]
)
)
time.perf_counter() - start
# 5.60
In addition, if I do the partition by id2
instead of id
, the time it takes will be even faster ~4s.
I also noticed the second approach (either partition by id
or id2
) has better CPU utilization rate than the first one. Maybe this is the reason why the second approach is faster.
My question is:
- Why the second approach is faster and has better CPU utilization?
- Shouldn't they be the same in terms of performance, since I think window/groupby operations will always be executed in parallel for each window/group and use as many available resources as possible?