3

I noticed a thing in python polars. I’m not sure but seems that pl.when().then().otherwise() is slow somewhere. For instance, for dataframe:

df = pl.DataFrame({
    'A': [randint(1, 10**15) for _ in range(30_000_000)],
    'B': [randint(1, 10**15) for _ in range(30_000_000)],
}, schema={
    'A': pl.UInt64,
    'B': pl.UInt64,
})

Horizontal min with pl.min_horizontal:

df.with_columns(
    pl.min_horizontal(['A', 'B']).alias('min_column')
)
92.4 ms ± 16.3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

And the same with when().then().otherwise():

df.with_columns(
    pl.when(
        pl.col('A') < pl.col('B')
    ).then(pl.col('A')).otherwise(pl.col('B')).alias('min_column'),
)
458 ms ± 75.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

I measure explicitly the when part and seems that it is not a bottleneck.

df.with_columns((pl.col('A') < pl.col('B')).alias('column_comparison'))
49.2 ms ± 6.23 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

If remove otherwise() it will be even slower.

df.with_columns(
    pl.when(
        pl.col('A') < pl.col('B')
    ).then(pl.col('A')).alias('min_column')
)
664 ms ± 19.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

I also have tried some other methods for horizontal reducing such as pl.reduce or pl.fold and seems that they all are much faster than when().then().

So the questions here:

  1. Is it expected behavior?
  2. Why pl.when().then() is much slower than other expressions?
  3. In which cases should we avoid when().then().otherwise()?
s-b90
  • 31
  • 4
  • 4
    Your question kind of reads as: "Is it a bug that an optimized, task-specific function is faster than `pl.when().then()`?" - in which case it seems like the answer is somewhat obvious? Perhaps another question to ask is "Can the Polars optimizer detect this case and call the underlying function for me?" – jqurious Aug 11 '23 at 13:02
  • For now, I just want to understand why `pl.when().then()` is sow slow. Is that a bug or all works as expected. If in general everything is ok, then definitely we can ask devs to add optimizations `pl.when().then()` for such tasks if it is possible. – s-b90 Aug 11 '23 at 13:41
  • 1
    To expand on your question if I time `df.with_columns(pl.when(pl.lit(1)==pl.lit(1)).then(pl.col('A')).otherwise(pl.col('B')))` I get 730 µs so there's more nuance to it just being `when.then`. In other words when/then can be fast, calculating the bool can be fast but doing them together is *relatively* slow. – Dean MacGregor Aug 11 '23 at 14:50
  • I assume the slowness is because the when/then/otherwise actually has to do more work: it has to do the comparison, just like `min_horizontal`, but then afterwards it has to go back and grab an element in the same row. Meanwhile `min_horizontal` only has to visit each element in a row once since. – BallpointBen Aug 11 '23 at 15:37

1 Answers1

0

I've got some comments from Polars developers at discord.

I don't see anything out of the ordinary here. Removing otherwise just means .otherwise(pl.lit(None)) is called in the background. It will have to create that column rather than using the existing one. So it will be slower. If you can write your expression as a fold it might be faster, as you have noticed with min_horizontal.

So my conclusion here: when you have a task to reduce several columns into one column, it is better choice to use fold or reduce methods when possible, instead of when().then().

s-b90
  • 31
  • 4