I noticed a thing in python polars. I’m not sure but seems that pl.when().then().otherwise() is slow somewhere. For instance, for dataframe:
df = pl.DataFrame({
'A': [randint(1, 10**15) for _ in range(30_000_000)],
'B': [randint(1, 10**15) for _ in range(30_000_000)],
}, schema={
'A': pl.UInt64,
'B': pl.UInt64,
})
Horizontal min with pl.min_horizontal:
df.with_columns(
pl.min_horizontal(['A', 'B']).alias('min_column')
)
92.4 ms ± 16.3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
And the same with when().then().otherwise():
df.with_columns(
pl.when(
pl.col('A') < pl.col('B')
).then(pl.col('A')).otherwise(pl.col('B')).alias('min_column'),
)
458 ms ± 75.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
I measure explicitly the when part and seems that it is not a bottleneck.
df.with_columns((pl.col('A') < pl.col('B')).alias('column_comparison'))
49.2 ms ± 6.23 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
If remove otherwise() it will be even slower.
df.with_columns(
pl.when(
pl.col('A') < pl.col('B')
).then(pl.col('A')).alias('min_column')
)
664 ms ± 19.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
I also have tried some other methods for horizontal reducing such as pl.reduce or pl.fold and seems that they all are much faster than when().then().
So the questions here:
- Is it expected behavior?
- Why pl.when().then() is much slower than other expressions?
- In which cases should we avoid
when().then().otherwise()
?