3

Starting with the example below:

import time
import numpy as np
import polars as pl

n_index = 1000
n_a = 10
n_b = 500
n_obs = 5000000

df = pl.DataFrame(
    {
        "id": np.random.randint(0, n_index, size=n_obs),
        "a": np.random.randint(0, n_a, size=n_obs),
        "b": np.random.randint(0, n_b, size=n_obs),
        "x": np.random.normal(0, 1, n_obs),
    }
).lazy()

dfs = [
    pl.DataFrame(
        {
            "id": np.random.randint(0, n_index, size=n_obs),
            "a": np.random.randint(0, n_a, size=n_obs),
            f"b_{i}": np.random.randint(0, n_b, size=n_obs),
            "x": np.random.normal(0, 1, n_obs),
        }
    ).lazy()
    for i in range(50)
]

res = [
    df.join(
        dfs[i], left_on=["id", "a", "b"], right_on=["id", "a", f"b_{i}"], how="inner"
    )
    .groupby(["a", "b"])
    .agg((pl.col("x") * pl.col("x_right")).sum().alias(f"x_{i}"))
    for i in range(50)
]

The task is really processing different dataframes, do some computations on them and then join back together all results. The code above constructs res which contains all results as a list.

As for joining back together the results, I tried two options as follows.

Option 1:

start = time.perf_counter()
res2 = pl.collect_all(res)
res3 = res2[0]
for i in range(1, 50):
    res3 = res3.join(res2[i], on=["a", "b"])
time.perf_counter() - start

Option 2:

start = time.perf_counter()
res4 = res[0]
for i in range(1, 50):
    res4 = res4.join(res[i], on=["a", "b"])
res4 = res4.collect()
time.perf_counter() - start

Option 1 does collect_all first and then joines all individual dataframes. Option 2 just does all things in an entirely lazy way and performs collect at the very end.

As far as I know, collect will do optimizations under the hood and I should expect option 1 and option 2 have similar performance. However, my benchmarking results show that option 2 takes twice as long as option 1 (21s vs. 10s on my system with 32 cores).

So, is this behavior kind of as expected? Or are there some inefficiencies about the approach I took?

One good thing about option 2 is that it is entirely lazy and it is a preferable approach in a case where we want to have an API which is entirely lazy and return a lazy dataframe and let users determine what to do next. But, from my experiment, performance is sacrificed by a lot. So, wondering if is there a way to do something like option 2 without sacrificing performance (performance comparable to option 1)?

lebesgue
  • 837
  • 4
  • 13

0 Answers0