1

For the following toy example which involves both calculations over window and groupby aggregations, DuckDB performs nearly 3x slower than Polars in Python. Both give exactly the same results.

Is this kind of benchmarking result as expected, because DuckDB is designed and should be used more for cross-dataframe/table operations?

Or, is it just because the inefficiency comes from the way my SQL query is written?

import time

import duckdb
import numpy as np
import polars as pl

## example dataframe
rng = np.random.default_rng(1)

nrows = 10_000_000
df = pl.DataFrame(
    dict(
        id=rng.integers(1, 100, nrows),
        id2=rng.integers(1, 1_000, nrows),
        v1=rng.normal(0, 1, nrows),
        v2=rng.normal(0, 1, nrows),
        v3=rng.normal(0, 1, nrows),
        v4=rng.normal(0, 1, nrows),
    )
)

## polars
start = time.perf_counter()
res = (
    df.select(
        [
            "id",
            "id2",
            pl.col("v1") - pl.col("v1").mean().over(["id", "id2"]),
            pl.col("v2") - pl.col("v2").mean().over(["id", "id2"]),
            pl.col("v3") - pl.col("v3").mean().over(["id", "id2"]),
            pl.col("v4") - pl.col("v4").mean().over(["id", "id2"]),
        ]
    )
    .groupby(["id", "id2"])
    .agg(
        [
            (pl.col("v1") * pl.col("v2")).sum().alias("ans1"),
            (pl.col("v3") * pl.col("v4")).sum().alias("ans2"),
        ]
    )
)
time.perf_counter() - start
# 1.0977217499166727

## duckdb
start = time.perf_counter()
res2 = (
    duckdb.sql(
        """
        SELECT id, id2,
        v1 - mean(v1) OVER (PARTITION BY id, id2) as v1,
        v2 - mean(v2) OVER (PARTITION BY id, id2) as v2,
        v3 - mean(v3) OVER (PARTITION BY id, id2) as v3,
        v4 - mean(v4) OVER (PARTITION BY id, id2) as v4,
        FROM df
        """
    )
    .aggregate(
        "id, id2, sum(v1 * v2) as ans1, sum(v3 * v4) as ans2",
        "id, id2",
    )
    .pl()
)
time.perf_counter() - start
# 3.549897135235369
lebesgue
  • 837
  • 4
  • 13

2 Answers2

1

I'm not an SQL expert, but it looks like your query is equivalent to:

duckdb.sql("""
   with mean as (
      from 
         df 
      select 
         id,
         id2,
         mean(v1) v1, 
         mean(v2) v2, 
         mean(v3) v3, 
         mean(v4) v4,
      group by id, id2
   )
   from 
      df left join mean using (id, id2)
   select
      id,
      id2,
      sum((df.v1 - mean.v1) * (df.v2 - mean.v2)) ans1,
      sum((df.v3 - mean.v3) * (df.v4 - mean.v4)) ans2,
   group by id, id2
""")

The polars equivalent would be similar to:

(df.groupby('id', 'id2')
   .agg(
      ans1 = (pl.col('v1') - pl.col('v1').mean()) * (pl.col('v2') - pl.col('v2').mean()),
      ans2 = (pl.col('v3') - pl.col('v3').mean()) * (pl.col('v4') - pl.col('v4').mean()),
   )
   .with_columns(
      pl.col('ans1', 'ans2').list.sum()
   )
)

The timings are much closer in those cases:

Method Time
polars v1 1.978311300976202
duckdb v1 3.188773022033274
polars v2 1.434564991039224
duckdb v2 1.2010211820015684
duckdb v3 3.0945987859740853
jqurious
  • 9,953
  • 1
  • 4
  • 14
  • Indeed, your query is actually equivalent to mine, but it is actually faster. I am not sure why it is the case. But it is still 2s which is still 2x slower than Polars one. – lebesgue Jun 17 '23 at 18:24
  • 1
    By the way, I am only measuring the first time run speed. Because subsequent runs will be much faster due to caching. – lebesgue Jun 17 '23 at 18:26
  • Ah, I had all the tests in a single file - silly me. I've split them all into separate files and updated the timings. The duckdb v2 still runs faster for me. I added in @hawkfish's query also as v3. – jqurious Jun 17 '23 at 18:35
  • For your duckdb v2, from my benchmarking results, the first time takes 1.98s, and subsequent runs take only 0.33s. So, seems like your timing is average time taken? – lebesgue Jun 17 '23 at 18:41
  • They're all first time run timings. – jqurious Jun 17 '23 at 18:48
1

What DuckDB version are you running against? The window computations you are using were optimised for 0.8.x because they are constant across each partition.

I made a benchmark for a similar query (just different random numbers) and it ran in about the same time as the rewritten v2 query by @jqurious:

load
SELECT SETSEED(0.8675309);
CREATE TABLE df AS
    SELECT 
        (random() * 100)::INTEGER + 1 AS id,
        (random() * 1000)::INTEGER + 1 AS id2,
        random() AS v1,
        random() AS v2,
        random() AS v3,
        random() AS v4,
    FROM range(10000000);

run
SELECT id, id2, sum(v1 * v2) as ans1, sum(v3 * v4) as ans2
FROM (
    SELECT id, id2,
    v1 - mean(v1) OVER (PARTITION BY id, id2) as v1,
    v2 - mean(v2) OVER (PARTITION BY id, id2) as v2,
    v3 - mean(v3) OVER (PARTITION BY id, id2) as v3,
    v4 - mean(v4) OVER (PARTITION BY id, id2) as v4,
    FROM df
)
GROUP BY ALL

Results:

name    run timing
benchmark/micro/window/temp.benchmark   1   0.930582
benchmark/micro/window/temp.benchmark   2   0.914845
benchmark/micro/window/temp.benchmark   3   0.965103
benchmark/micro/window/temp.benchmark   4   0.951137
benchmark/micro/window/temp.benchmark   5   0.945187

With some pending vectorisation improvements, it drops another 10%:

name    run timing
benchmark/micro/window/temp.benchmark   1   0.823040
benchmark/micro/window/temp.benchmark   2   0.826832
benchmark/micro/window/temp.benchmark   3   0.851842
benchmark/micro/window/temp.benchmark   4   0.797956
benchmark/micro/window/temp.benchmark   5   0.861512
hawkfish
  • 71
  • 3
  • I am using the latest version - 0.8.1. – lebesgue Jun 17 '23 at 16:37
  • Is it because there are some overhead in my calculations as there are conversions from/to Polars DataFrame? And, what you are testing is pure SQL. – lebesgue Jun 17 '23 at 16:40
  • Also, the only difference between my query and yours is my query is separate into two parts (one for prep and one for final aggregation for clear and better understanding). Does this matter? I assume not because there should be query optimizations done under the hood anyways? – lebesgue Jun 17 '23 at 16:43
  • DuckDB uses the same internal in-memory table API to scan both internal tables and data frames, so I doubt there is any conversion difference. Also, I'm not sure what you mean by "my query is separate into two parts" - I believe the stacked Python calls will be collapsed into a single query. Incidentally, I used the benchmarking framework because it does a warm-up run precisely to avoid the 1.98/0.33 difference you observed, before performing 5 runs in quick succession using pre-built data. Benchmarking is hard. – hawkfish Jun 18 '23 at 13:33