7

I need to create a new column in my dataframe that stores the processed values. So I used polars apply function to do some processing of dicoms and then return value. But this apply function by default takes the entire column as polars Series and it doesn't process row by row.

df = df.with_columns(
        [
            pl.apply(
                exprs=["Filename", "Dicom_Tag", "Dicom_Tag_Corrected", "Name"],
                f=apply_corrections_polars,
            ).alias("dicom_tag_value_corrected"),
        ]
    )
Pradeepgb
  • 71
  • 1
  • 4

1 Answers1

11

As the documentation of pl.apply states, it should not be used in the select context. It should only be used in groupby operations to apply over different groups.

For custom functions over items, you can utilize the Struct data type.

Since polars>=0.13.16 you can apply over Struct dtypes. A Struct can be composed of any column in polars.

df = pl.DataFrame({"ham": [2, 2, 3], 
              "spam": [11, 22, 33], 
              "foo": [3, 2, 1]})

def my_complicated_function(struct: dict) -> int:
    """
    A function that can not utilize polars expressions.
    This should be avoided.
    """

    # do work
    return struct["ham"] + struct["spam"] + struct["foo"]

df.select([
    pl.struct(["ham", "spam", "foo"]).apply(my_complicated_function)
])

shape: (3, 1)
┌─────┐
│ ham │
│ --- │
│ i64 │
╞═════╡
│ 16  │
├╌╌╌╌╌┤
│ 26  │
├╌╌╌╌╌┤
│ 37  │
└─────┘

ritchie46
  • 10,405
  • 1
  • 24
  • 43
  • 1
    Wow. This new approach using struct performs *much* better! I benchmarked this approach using the struct expression versus the solution using map and a list of expressions. I created a dataframe of 4 columns of 10 million integers each, and a trivial row-wise sum function (lol - not that one should *ever* use either approach for calculating row-sums). The old approach took 49 seconds. The new approach (using struct) used only 12 seconds! Accordingly, I'm removing my answer. Great job! –  Mar 30 '22 at 19:02
  • 1
    Nice! I am also thinking of removing the dataframe apply in favor of this. That should make it clear that there is a single way of doing this. – ritchie46 Mar 30 '22 at 19:06
  • NotFoundError: ham in the newest version – lemmingxuan May 09 '22 at 07:50
  • I think you should open an issue at github. In any case, I can confirm that this snippet runs successfully on latest release on pypi. – ritchie46 May 09 '22 at 09:24