Comparing the following two code sections with the only difference as the second one first computes rank
, the second section results in much slower performance than the first one (~5x).
Although the second section involves a few more extra computations (rank
), I don't expect and understand such a big difference. Am I implementing the query very inefficiently (I don't so as well because I think DuckDB should have query optimizations in place)?
Note: Similar computations using Polars will not result in such a big slow down (code not shown here, otherwise people will be overwhelmed with too many codes).
import time
import duckdb
import numpy as np
import polars as pl
## example dataframe
rng = np.random.default_rng(1)
nrows = 5_000_000
df = pl.DataFrame(
dict(
id=rng.integers(1, 1_000, nrows),
id2=rng.integers(1, 10, nrows),
id3=rng.integers(1, 500, nrows),
v1=rng.normal(0, 1, nrows),
v2=rng.normal(0, 1, nrows),
)
)
## pearson correlation
start = time.perf_counter()
res = duckdb.sql(
"""
WITH cte AS (
SELECT df.id2, df.id3, df2.id3 AS id3_right,
df.v1, df2.v1 AS v1_right,
df.v2, df2.v2 AS v2_right
FROM df
JOIN df AS df2
ON (
df.id = df2.id
AND df.id2 = df2.id2
AND df.id3 > df2.id3
AND df.id3 < df2.id3 + 30
)
)
SELECT id2, id3, id3_right, corr(v1, v1_right) AS v1, corr(v2, v2_right) AS v2
FROM cte
GROUP BY id2, id3, id3_right
"""
).pl()
time.perf_counter() - start
# 19.462523670867085
## rank correlation
start = time.perf_counter()
res2 = duckdb.sql(
"""
WITH cte AS (
SELECT df.id2, df.id3, df2.id3 AS id3_right,
RANK() OVER (g ORDER BY df.v1) AS v1,
RANK() OVER (g ORDER BY df2.v1) AS v1_right,
RANK() OVER (g ORDER BY df.v2) AS v2,
RANK() OVER (g ORDER BY df2.v2) AS v2_right
FROM df
JOIN df AS df2
ON (
df.id = df2.id
AND df.id2 = df2.id2
AND df.id3 > df2.id3
AND df.id3 < df2.id3 + 30
)
WINDOW g AS (PARTITION BY df.id2, df.id3, df2.id3)
)
SELECT id2, id3, id3_right, corr(v1, v1_right) AS v1, corr(v2, v2_right) AS v2
FROM cte
GROUP BY id2, id3, id3_right
"""
).pl()
time.perf_counter() - start
# 104.54312287131324