2

Comparing the following two code sections with the only difference as the second one first computes rank, the second section results in much slower performance than the first one (~5x).

Although the second section involves a few more extra computations (rank), I don't expect and understand such a big difference. Am I implementing the query very inefficiently (I don't so as well because I think DuckDB should have query optimizations in place)?

Note: Similar computations using Polars will not result in such a big slow down (code not shown here, otherwise people will be overwhelmed with too many codes).

import time

import duckdb
import numpy as np
import polars as pl

## example dataframe
rng = np.random.default_rng(1)

nrows = 5_000_000
df = pl.DataFrame(
    dict(
        id=rng.integers(1, 1_000, nrows),
        id2=rng.integers(1, 10, nrows),
        id3=rng.integers(1, 500, nrows),
        v1=rng.normal(0, 1, nrows),
        v2=rng.normal(0, 1, nrows),
    )
)

## pearson correlation
start = time.perf_counter()
res = duckdb.sql(
    """
    WITH cte AS (
        SELECT df.id2, df.id3, df2.id3 AS id3_right,
            df.v1, df2.v1 AS v1_right,
            df.v2, df2.v2 AS v2_right
        FROM df
        JOIN df AS df2
        ON (
            df.id = df2.id
            AND df.id2 = df2.id2
            AND df.id3 > df2.id3
            AND df.id3 < df2.id3 + 30
        )
    )
    SELECT id2, id3, id3_right, corr(v1, v1_right) AS v1, corr(v2, v2_right) AS v2
    FROM cte
    GROUP BY id2, id3, id3_right
    """
).pl()
time.perf_counter() - start
# 19.462523670867085

## rank correlation
start = time.perf_counter()
res2 = duckdb.sql(
    """
    WITH cte AS (
        SELECT df.id2, df.id3, df2.id3 AS id3_right,
            RANK() OVER (g ORDER BY df.v1) AS v1,
            RANK() OVER (g ORDER BY df2.v1) AS v1_right,
            RANK() OVER (g ORDER BY df.v2) AS v2,
            RANK() OVER (g ORDER BY df2.v2) AS v2_right
        FROM df
        JOIN df AS df2
        ON (
            df.id = df2.id
            AND df.id2 = df2.id2
            AND df.id3 > df2.id3
            AND df.id3 < df2.id3 + 30
        )
        WINDOW g AS (PARTITION BY df.id2, df.id3, df2.id3)
    )
    SELECT id2, id3, id3_right, corr(v1, v1_right) AS v1, corr(v2, v2_right) AS v2
    FROM cte
    GROUP BY id2, id3, id3_right
    """
).pl()
time.perf_counter() - start
# 104.54312287131324
lebesgue
  • 837
  • 4
  • 13
  • Rank is slow as alpst Eve y window function but doing it manually was even slower and you have 4 of them – nbk Jun 17 '23 at 20:29

1 Answers1

1

The rank correlation query computes RANK() in four different windows, each of which requires partitioning and sorting the data set. There is an optimisation here that could be done to share the partitioning and only re-sort, but that has not been implemented. It may not make a difference, however, as the partitions will be re-read on separate threads would likely just end up being copied. Also, partitioning is linear, so the sorting would likely dominate. RANK() is also peer-sensitive, and I suspect that the current implementation is suboptimal.

It might be a better approach for you to file these questions/issues with the DuckDB issue tracking system on GitHub instead of asking the entire world to answer highly technical questions about the internals that only about 3 people can answer? If you did that we could have a more detailed discussion.

hawkfish
  • 71
  • 3
  • Thanks. I am not an experienced user on either SQL or DuckDB. So, just decided to ask here first to see whether the slowness comes from some naive mistakes or mis-understandings of DuckDB/SQL. – lebesgue Jun 18 '23 at 17:21
  • No worries! We do appreciate the questions and can bring up such issues here: https://github.com/duckdb/duckdb/issues – hawkfish Jun 19 '23 at 18:36