0

How do I use a ufunc that reduces to a scalar in the context of aggregation? For example, summarizing a table using numpy.trapz:

import polars as pl
import numpy as np

df = pl.DataFrame(dict(id=[0, 0, 0, 1, 1, 1], t=[2, 4, 5, 10, 11, 14], y=[0, 1, 1, 2, 3, 4]))
df.groupby('id').agg(pl.map(['t', 'y'], np.trapz))
# Segmentation fault (core dumped)
drhagen
  • 8,331
  • 8
  • 53
  • 82
  • I have made sure that we now throw an error hinting the solution instead of the core dump: https://github.com/pola-rs/polars/pull/3052 – ritchie46 Apr 03 '22 at 09:15

1 Answers1

2

Edit: as of Polars 0.13.18, the apply method converts Numpy datatypes to Polars datatypes without requiring the Numpy item method.

Use apply in a groupby context (rather than map).

In this case, the numpy trapz function takes only one positional parameter (y)

numpy.trapz(y, x=None, dx=1.0, axis=- 1)

So, we'll need to specify the x keyword parameter explicitly in our call. (I also assumed that you meant for your y column to be mapped as the y parameter, and your t column to be mapped as the x parameter in the call to numpy.)

The Series 'y' and 't' will be passed as a list of Series to the lambda function, so we'll use indices to indicate which column maps to which numpy parameter.

One additional wrinkle, numpy returns a value of type numpy.float64, rather than a Python float.

type(np.trapz([0, 1, 1], x=[2, 4, 5]))
<class 'numpy.float64'>

Presently, the apply function in Polars will not automatically convert a numpy.float64 to polars.Float64. To remedy this, we'll use the numpy item method to have numpy return a Python float, rather than a numpy.float64.

type(np.trapz([0, 1, 1], x=[2, 4, 5]).item())
<class 'float'>

With this in hand, we can now write our apply statement.

df.groupby("id").agg(
    pl.apply(
        ["y", "t"],
        lambda lst: np.trapz(y=lst[0], x=lst[1]).item()
    )
)
shape: (2, 2)
┌─────┬──────┐
│ id  ┆ y    │
│ --- ┆ ---  │
│ i64 ┆ f64  │
╞═════╪══════╡
│ 1   ┆ 13.0 │
├╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 0   ┆ 2.0  │
└─────┴──────┘
  • Yep, `apply` is the solution. I would prefer that that I got a numeric column rather than an object, but `return_dtype=pl.datatypes.Float64` does not appear to do anything. Am I misunderstanding that argument? – drhagen Apr 03 '22 at 10:48
  • Sorry about that. Numpy is returning a value of type , which polars is not automatically converting to type polars.Float64. There's an easy fix. I'll revise my answer so that apply returns a column of type polars.Float64. –  Apr 03 '22 at 16:01