2

I have a polars dataframe illustrated as follows.

import polars as pl

df = pl.DataFrame(
    {
        "a": [1, 4, 3, 2, 8, 4, 5, 6],
        "b": [2, 3, 1, 3, 9, 7, 6, 8],
        "c": [1, 1, 1, 1, 2, 2, 2, 2],
    }
)

The task I have is

  1. groupby column "c"
  2. for each group, check whether all numbers from column "a" is less than corresponding values from column "b".
    • If so, just return a column same as "a" in the groupby context.
    • Otherwise, apply a third-party function called "convert" which takes two numpy arrays and return a single numpy array with the same size, so in my case, I can first convert column "a" and "b" to numpy arrays and supply them as inputs to "convert". Finally, return the array returned from "convert" (probably need to transform it to polars series before returning) in the groupby context.

So, for the example above, the output I want is as follows (exploded after groupby for better illustration).

shape: (8, 2)
┌─────┬─────┐
│ c   ┆ a   │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 1   ┆ 1   │
│ 1   ┆ 3   │
│ 1   ┆ 1   │
│ 1   ┆ 2   │
│ 2   ┆ 8   │
│ 2   ┆ 4   │
│ 2   ┆ 5   │
│ 2   ┆ 6   │
└─────┴─────┘

With the assumption,

>>> import numpy as np
>>> convert(np.array([1, 4, 3, 2]), np.array([2, 3, 1, 3]))
np.array([1, 3, 1, 2])

# [1, 4, 3, 2] is from column a of df when column c is 1, and [2, 3, 1, 3] comes from column b of df when column c is 1.
# I have to apply my custom python function 'convert' for the c == 1 group, because not all values in a are smaller than those in b according to the task description above.

My question is how am I supposed to implement this logic in a performant or polars idiomatic way without sacrificing so much speed gained from running Rust code and parallelization?

The reason I ask is because from my understanding, using apply with custom python function will slow down the program, but in my case, in certain scenarios, I will not need to resort to a third-party function for help. So, is there any way I can get the best of worlds somehow? (for scenarios where no third-party function is required, get full benefits of polars, and only apply third-party function when necessary).

lebesgue
  • 837
  • 4
  • 13
  • What does `convert` really do? Your example is the same as `df.with_columns(pl.min(["a", "b"]).over("c"))`? – jqurious Feb 09 '23 at 23:44
  • Think of it as a complex black box function in general which uses functionalities from numpy and scipy and highly likely it cannot be done by using polars expressions, at least for now. The numbers there are just for illustrative purposes. Don't take those actual numbers too seriously. The point is that in certain cases I have to use convert. – lebesgue Feb 10 '23 at 01:21
  • 1
    Okay, maybe something can be done with expressions - there were some interesting replies about numpy/scipy here: https://stackoverflow.com/questions/75303038/how-to-write-poisson-cdf-as-python-polars-expression/75311287#75311287 – jqurious Feb 10 '23 at 05:46

1 Answers1

3

It sounds like you want to find matching groups:

(
   df
   .with_row_count()
   .filter(
      (pl.col("a") >= pl.col("b"))
      .any()
      .over("c"))
)
shape: (4, 4)
┌────────┬─────┬─────┬─────┐
│ row_nr | a   | b   | c   │
│ ---    | --- | --- | --- │
│ u32    | i64 | i64 | i64 │
╞════════╪═════╪═════╪═════╡
│ 0      | 1   | 2   | 1   │
│ 1      | 4   | 3   | 1   │
│ 2      | 3   | 1   | 1   │
│ 3      | 2   | 3   | 1   │
└────────┴─────┴─────┴─────┘

And apply your custom function over each group.

(      
   df
   .with_row_count()
   .filter(
      (pl.col("a") >= pl.col("b"))
      .any()
      .over("c"))
   .select(
      pl.col("row_nr"),
      pl.apply(
         ["a", "b"], # np.minimum is just for example purposes
         lambda s: np.minimum(s[0], s[1]))
      .over("c"))
)
shape: (4, 2)
┌────────┬─────┐
│ row_nr | a   │
│ ---    | --- │
│ u32    | i64 │
╞════════╪═════╡
│ 0      | 1   │
│ 1      | 3   │
│ 2      | 1   │
│ 3      | 2   │
└────────┴─────┘

(Note: there may be some useful information in How to Write Poisson CDF as Python Polars Expression with regards to scipy/numpy ufuncs and potentially avoiding .apply())

You can then .join() the result back into the original data.

(
   df
   .with_row_count()
   .join(
      df
      .with_row_count()
      .filter(
         (pl.col("a") >= pl.col("b"))
         .any()
         .over("c"))
      .select(
         pl.col("row_nr"),
         pl.apply(
            ["a", "b"],
            lambda s: np.minimum(s[0], s[1]))
         .over("c")),
      on="row_nr",
      how="left")
)
shape: (8, 5)
┌────────┬─────┬─────┬─────┬─────────┐
│ row_nr | a   | b   | c   | a_right │
│ ---    | --- | --- | --- | ---     │
│ u32    | i64 | i64 | i64 | i64     │
╞════════╪═════╪═════╪═════╪═════════╡
│ 0      | 1   | 2   | 1   | 1       │
│ 1      | 4   | 3   | 1   | 3       │
│ 2      | 3   | 1   | 1   | 1       │
│ 3      | 2   | 3   | 1   | 2       │
│ 4      | 8   | 9   | 2   | null    │
│ 5      | 4   | 7   | 2   | null    │
│ 6      | 5   | 6   | 2   | null    │
│ 7      | 6   | 8   | 2   | null    │
└────────┴─────┴─────┴─────┴─────────┘

You can then fill in the nulls.

.with_columns(
   pl.col("a_right").fill_null(pl.col("a")))
jqurious
  • 9,953
  • 1
  • 4
  • 14
  • Thanks! How does this compare with using apply on all groups directly (with the custom function being a wrapper over ‘convert’)? Will this method results in faster performance? – lebesgue Feb 10 '23 at 14:27
  • You would have to create a test case with your real data and real "convert" function and benchmark the two approaches. One would think the filter/join to skip non-matching groups would be "faster" - but benchmarking your specific use-case is how to find out. And you could let us know which was faster. – jqurious Feb 10 '23 at 15:06
  • Hi @jqurious, I was reading an SO answer a few days back recommeding that apply should be used within a groupby. And when using it within a select, it’s better to use it with a pl.struct for performance reason. Is that correct ? Or it doesn’t really matter whether using pl.apply directly , or with pl.struct – Luca Feb 11 '23 at 22:11
  • @Luca Can you link to the SO answer? – jqurious Feb 12 '23 at 00:32
  • @jqurious : here is the link to the answer from Ritchie: https://stackoverflow.com/questions/71658991/how-to-write-polars-custom-apply-function-that-does-the-processing-row-by-row I am not sure if I am interpreting it the right way – Luca Feb 13 '23 at 15:58
  • @Luca I guess it can be somewhat confusing. Even though it's inside a `.select()` what is being used here is `pl.apply().over()` - the `.over()` [creates a groupby context.](https://pola-rs.github.io/polars-book/user-guide/dsl/window_functions.html#groupby-aggregations-in-selection) – jqurious Feb 13 '23 at 18:43
  • @jqurious got it. do you know why apply performs worse outside of a groupby? Or if there is any SO question explaining this – Luca Feb 13 '23 at 19:17
  • I'm not sure if there is an SO question - but perhaps this helps: https://bpa.st/raw/EY752 - essentially, in a "select context" when you `.apply()` a function, it is called once per row. – jqurious Feb 13 '23 at 21:21
  • Follow-up the thread, how do we compare pl.apply().over() with pl.struct().over() in terms of performance? – lebesgue Feb 22 '23 at 21:09
  • @lebesgue I'm not sure. Perhaps you can compare the approaches on your data and tell us? – jqurious Feb 22 '23 at 21:25