2

I want to calculate spearman (rank) correlation in a groupby context using DuckDB/SQL syntax. I tried the following, but failed.

import duckdb
import pandas as pd

df = pd.DataFrame(
    {
        "a": [1, 1, 2, 2, 6, 1, 3, 6, 3],
        "b": [4, 2, 6, 4, 3, 1, 6, 4, 8],
        "c": [1, 1, 1, 1, 2, 2, 2, 2, 2],
    }
)

duckdb.sql(
    """
    SELECT c, corr(rank() OVER (ORDER BY a), rank() OVER (ORDER BY b)) as corr
    FROM df
    GROUP BY c
    """
)

Error message:

Parse Error: aggregate function calls cannot contain window function calls

Expected result:

c  corr
-  ----
1  0.70711
2  0.10541
Keptain
  • 147
  • 7

1 Answers1

1

The error is saying that rank() cannot be called inside corr() like that.

You can use a common table expression and call corr() on the result.

duckdb.sql(
    """
    with rank as (
       from df
       select 
          rank() over (partition by c order by a) a,
          rank() over (partition by c order by b) b,
          c,
    )
    from rank
    select
       c, 
       corr(a, b)
       group by c
    """
)

However the result is:

┌───────┬────────────────────┐
│   c   │     corr(a, b)     │
│ int64 │       double       │
├───────┼────────────────────┤
│     1 │ 0.6882472016116852 │
│     2 │                0.0 │
└───────┴────────────────────┘

The ranks:

duckdb.sql(
    """
    with rank as (
       from df
       select 
          rank() over (partition by c order by a) a,
          rank() over (partition by c order by b) b,
          c,
    )
    from rank
    """
)
┌───────┬───────┬───────┐
│   a   │   b   │   c   │
│ int64 │ int64 │ int64 │
├───────┼───────┼───────┤
│     1 │     1 │     2 │
│     4 │     2 │     2 │
│     4 │     3 │     2 │
│     2 │     4 │     2 │
│     2 │     5 │     2 │
│     1 │     1 │     1 │
│     1 │     2 │     1 │
│     3 │     2 │     1 │
│     3 │     4 │     1 │
└───────┴───────┴───────┘
jqurious
  • 9,953
  • 1
  • 4
  • 14
  • That is my question 2. Is it possible to calculate spearman correlation in SQL/DuckDB which could match exactly with results from python/scipy? – Keptain Jun 17 '23 at 22:57
  • I don't understand the math part myself, it's probably better to ask on the [duckdb issues/discussions](https://github.com/duckdb/duckdb/issues) if the results are not what you expect. – jqurious Jun 18 '23 at 16:53