Polars is much slower than DuckDB in conditional join + groupby/agg context

Question

For the following example, where it involves a self conditional join and a subsequent groupby/aggregate operation. It turned out that in such case, DuckDB gives much better performance than Polars (~10x on a 32-core machine).

My questions are:

What could be the potential reason(s) for the slowness (relative to DuckDB) of Polars?
Am I missing some other faster ways of doing the same thing in Polars?

import time

import duckdb
import numpy as np
import polars as pl

## example dataframe
rng = np.random.default_rng(1)

nrows = 5_000_000
df = pl.DataFrame(
    dict(
        id=rng.integers(1, 1_000, nrows),
        id2=rng.integers(1, 10, nrows),
        id3=rng.integers(1, 500, nrows),
        value=rng.normal(0, 1, nrows),
    )
)

## polars
start = time.perf_counter()
res = (
    df.lazy()
    .join(df.lazy(), on=["id", "id2"], how="left")
    .filter(
        (pl.col("id3") > pl.col("id3_right"))
        & (pl.col("id3") - pl.col("id3_right") < 30)
    )
    .groupby(["id2", "id3", "id3_right"])
    .agg(pl.corr("value", "value_right"))
    .collect(streaming=True)
)
time.perf_counter() - start
# 120.93155245436355

## duckdb
start = time.perf_counter()
res2 = (
    duckdb.sql(
        """
        SELECT df.*, df2.id3 as id3_right, df2.value as value_right
        FROM df JOIN df as df2
        ON (df.id = df2.id
        AND df.id2 = df2.id2
        AND df.id3 > df2.id3
        AND df.id3 - df2.id3 < 30)
        """
    )
    .aggregate(
        "id2, id3, id3_right, corr(value, value_right) as value",
        "id2, id3, id3_right",
    )
    .pl()
)
time.perf_counter() - start
# 18.472263277042657

It's also to do with the fact that the DuckDb version performs a range join. https://duckdb.org/2022/05/27/iejoin.html — jqurious, Jun 15 '23 at 05:12
[v0.18.3](https://github.com/pola-rs/polars/releases/tag/py-0.18.3) was just released. There is now a `1.8x` difference compared to a `15x` difference using `v0.18.2` when I test this. — jqurious, Jun 16 '23 at 16:40

ritchie46 · Accepted Answer · 2023-06-18T07:03:31.523

EDIT: 2023-7-18

The latest polars release has brought the difference down from 15x to 2x.

polars v0.18.2  1125
polars v0.18.3  140
duckdb 0.8.2-dev1   75

Original answer

Streaming engine

The streaming API isn't as optimized yet. Polars is a younger project than DuckDB and we haven't got as many paid developers on the project.

So give us time. Next release 0.18.3 will land a PR that can make a streaming groupby over 3.5x faster https://github.com/pola-rs/polars/pull/9346.

That just shows how much we still have on the table on the streaming engine. That same optimization we still have to do for streaming joins.

In short. Our streaming engine is in alpha stage. It is work in progress.

Different algorithm

Other that that, the duckdb query might also be using non-equi joins under the hood which we don't have yet in polars, so this query might not be as optimal for polars.

hawkfish · Answer 2 · 2023-06-15T16:37:52.113

While DuckDB does have several non-equi-joins, the planner currently assumes that all equality predicates are more selective than inequalities an just generates a hash join here:

D EXPLAIN SELECT id2, id3, id3_right, corr(value, value_right) as value
> FROM (
>     SELECT df.*, df2.id3 as id3_right, df2.value as value_right
>     FROM df JOIN df as df2
>         ON (df.id = df2.id
>         AND df.id2 = df2.id2
>         AND df.id3 > df2.id3
>         AND df.id3 - df2.id3 < 30)
>     ) tbl
> GROUP BY ALL
> ;

┌─────────────────────────────┐
│┌───────────────────────────┐│
││       Physical Plan       ││
│└───────────────────────────┘│
└─────────────────────────────┘
┌───────────────────────────┐                             
│       HASH_GROUP_BY       │                             
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │                             
│             #0            │                             
│             #1            │                             
│             #2            │                             
│        corr(#3, #4)       │                             
└─────────────┬─────────────┘                                                          
┌─────────────┴─────────────┐                             
│         PROJECTION        │                             
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │                             
│            id2            │                             
│            id3            │                             
│         id3_right         │                             
│           value           │                             
│        value_right        │                             
└─────────────┬─────────────┘                                                          
┌─────────────┴─────────────┐                             
│         PROJECTION        │                             
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │                             
│             #1            │                             
│             #2            │                             
│             #3            │                             
│             #4            │                             
│             #5            │                             
└─────────────┬─────────────┘                                                          
┌─────────────┴─────────────┐                             
│           FILTER          │                             
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │                             
│     ((id3 - id3) < 30)    │                             
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │                             
│      EC: 24727992087      │                             
└─────────────┬─────────────┘                                                          
┌─────────────┴─────────────┐                             
│         HASH_JOIN         │                             
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │                             
│           INNER           │                             
│          id = id          │                             
│         id2 = id2         ├──────────────┐              
│         id3 > id3         │              │              
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │              │              
│      EC: 24727992087      │              │              
│     Cost: 24727992087     │              │              
└─────────────┬─────────────┘              │                                           
┌─────────────┴─────────────┐┌─────────────┴─────────────┐
│         SEQ_SCAN          ││         SEQ_SCAN          │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   ││   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│             df            ││             df            │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   ││   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│             id            ││             id            │
│            id2            ││            id2            │
│            id3            ││            id3            │
│           value           ││           value           │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   ││   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│        EC: 5000000        ││        EC: 5000000        │
└───────────────────────────┘└───────────────────────────┘

We plan to address this in a future release.

Note also that the IEJoin algorithm requires two inequalities and the query only has one. Single inequalities could be handled by the PieceWiseMergeJoin operator, but PWMJ does not currently handle simple equalities (the logic would just have to be extended to handle NULLs correctly).

Does it mean that the kind of operations in my example could be potentially faster in future? — lebesgue, Jun 15 '23 at 20:26
Quite possibly. I did a little test forcing it to use IEJoin (after changing the last predicate to have one field on each side...) and it was 2-3x faster. — hawkfish, Jun 15 '23 at 23:23
With a proper benchmark using a new pragma, the gain is more like 6x. — hawkfish, Jun 17 '23 at 13:46
It would be a prerelease (`pip install duckdb --pre --upgrade`) but it is still in code review. — hawkfish, Jun 19 '23 at 18:38

Polars is much slower than DuckDB in conditional join + groupby/agg context

2 Answers2

EDIT: 2023-7-18

Original answer

Streaming engine

Different algorithm