0

This is a new question/issue as a follow up to How to return multiple stats as multiple columns in Polars grouby context? and How to flatten/split a tuple of arrays and calculate column means in Polars dataframe?

Basically, the problem/issue can be easily illustrated by the example below:

from functools import partial

import polars as pl
import statsmodels.api as sm


def ols_stats(s, yvar, xvars):
    df = s.struct.unnest()
    reg = sm.OLS(df[yvar].to_numpy(), df[xvars].to_numpy(), missing="drop").fit()
    return pl.Series(values=(reg.params, reg.tvalues), nan_to_null=True)


df = pl.DataFrame(
    {
        "day": [1, 1, 1, 1, 1, 2, 2, 2, 2, 2],
        "y": [1, 6, 3, 2, 8, 4, 5, 2, 7, 3],
        "x1": [1, 8, 2, 3, 5, 2, 1, 2, 7, 3],
        "x2": [8, 5, 3, 6, 3, 7, 3, 2, 9, 1],
    }
).lazy()

res = df.groupby("day").agg(
    pl.struct(["y", "x1", "x2"])
    .apply(partial(ols_stats, yvar="y", xvars=["x1", "x2"]))
    .alias("params")
)

res.with_columns(
    pl.col("params").arr.eval(pl.element().arr.explode()).arr.to_struct()
).unnest("params").collect()

After running the code above, the following error is got:

pyo3_runtime.PanicException: not implemented for dtype Unknown

But when .lazy() and .collect() are removed from the code above, the code works perfectly as intended. Below are the results (expected behavior) if running in eager mode.

shape: (2, 5)
┌─────┬──────────┬──────────┬──────────┬───────────┐
│ day ┆ field_0  ┆ field_1  ┆ field_2  ┆ field_3   │
│ --- ┆ ---      ┆ ---      ┆ ---      ┆ ---       │
│ i64 ┆ f64      ┆ f64      ┆ f64      ┆ f64       │
╞═════╪══════════╪══════════╪══════════╪═══════════╡
│ 2   ┆ 0.466089 ┆ 0.503127 ┆ 0.916982 ┆ 1.451151  │
│ 1   ┆ 1.008659 ┆ -0.03324 ┆ 3.204266 ┆ -0.124422 │
└─────┴──────────┴──────────┴──────────┴───────────┘

So, what is the problem and how am I supposed to resolve it?

lebesgue
  • 837
  • 4
  • 13
  • If there is an exception with `.lazy()` and it works without `.lazy()` - it probably needs to be reported as a bug: https://github.com/pola-rs/polars/issues/ – jqurious Mar 03 '23 at 23:26

2 Answers2

1

Don't return a Series from ols_stats() but a dict then it should work. This is also semantically better as the struct you show in the end is a mess: the first two fields mean params, the second two fields mean tvalues. Try this instead:

def ols_stats(s, yvar, xvars):
    df = s.struct.unnest()
    reg = sm.OLS(df[yvar].to_numpy(), df[xvars].to_numpy(), missing="drop").fit()
    return {"params":reg.params.tolist(),"tvalues":reg.tvalues.tolist()}

Polars automatically turns the dict[list[f64]] into a struct[2]. I had to play around a bit to figure this out but it seems to work.

This way you end up with semantically meaningful results:

shape: (3, 3)
┌─────┬─────────────────────────────────┬────────────────────────────────┐
│ day ┆ params                          ┆ tvalues                        │
│ --- ┆ ---                             ┆ ---                            │
│ i64 ┆ list[f64]                       ┆ list[f64]                      │
╞═════╪═════════════════════════════════╪════════════════════════════════╡
│ 1   ┆ [4.866232, 0.640294, -0.659869] ┆ [1.547251, 1.81586, -1.430613] │
│ 3   ┆ [0.5, 0.5]                      ┆ [0.0, 0.0]                     │
│ 2   ┆ [2.0462, 0.223971, 0.336793]    ┆ [1.524834, 0.495378, 1.091109] │
└─────┴─────────────────────────────────┴────────────────────────────────┘

Now it works lazily:

res = df.lazy().groupby("day").agg(
    pl.struct(["y", "x1", "x2"])
    .apply(partial(ols_stats, yvar="y", xvars=["x1", "x2"]))
    .alias("params")
).unnest("params").collect()

If you want things unnested, why not return them unnested immediately as:

def ols_stats(s, yvar, xvars):
    df = s.struct.unnest()
    reg = sm.OLS(df[yvar].to_numpy(), df[xvars].to_numpy(), missing="drop").fit()
    param_dict = {f"param_{i}": v for i, v in enumerate(reg.params.tolist())}
    tvalues_dict = {f"tvalue_{i}": v for i, v in enumerate(reg.tvalues.tolist())}
    return (param_dict | tvalues_dict)

df = pl.DataFrame(
    {
        "day": [1, 1, 1, 1, 1, 2, 2, 2, 2, 2],
        "y": [1, 6, 3, 2, 8, 4, 5, 2, 7, 3],
        "x1": [1, 8, 2, 3, 5, 2, 1, 2, 7, 3],
        "x2": [8, 5, 3, 6, 3, 7, 3, 2, 9, 1],
    }
).lazy()

res = df.groupby("day").agg(
    pl.struct(["y", "x1", "x2"])
    .apply(partial(ols_stats, yvar="y", xvars=["x1", "x2"]))
    .alias("results")
).unnest("results").collect()
print(res)

Returns:

shape: (2, 5)
┌─────┬──────────┬──────────┬──────────┬───────────┐
│ day ┆ param_0  ┆ param_1  ┆ tvalue_0 ┆ tvalue_1  │
│ --- ┆ ---      ┆ ---      ┆ ---      ┆ ---       │
│ i64 ┆ f64      ┆ f64      ┆ f64      ┆ f64       │
╞═════╪══════════╪══════════╪══════════╪═══════════╡
│ 1   ┆ 1.008659 ┆ -0.03324 ┆ 3.204266 ┆ -0.124422 │
│ 2   ┆ 0.466089 ┆ 0.503127 ┆ 0.916982 ┆ 1.451151  │
└─────┴──────────┴──────────┴──────────┴───────────┘
Cornelius Roemer
  • 3,772
  • 1
  • 24
  • 55
  • It looks like after ruturning a dict, only the first level can be unpacked successfully. In order to make each scalar stat in a separate column at the end, have to add this to the end of your code (remove ```.collect()``` first): ```.with_columns(pl.col("params").arr.to_struct()).unnest("params").collect()```. And, still gets an error thereafter. – lebesgue Mar 04 '23 at 05:55
  • By the way, my example df removed the corner/problematic day 3. – lebesgue Mar 04 '23 at 06:00
  • Why do you want to unpack the parameters? Just leave them as a `list[f64]` - what's the point of having some fields? As discussed above, `pl.col("params").arr.eval(pl.element().arr.explode()).arr.to_struct()` is the reason you get the error. If you know that the regression always returns lists of length 3, you can return them in the dict immediately. No need to explode things of unknown length. – Cornelius Roemer Mar 04 '23 at 06:22
  • Unpack to separate columns so as to do more downstream processes conveniently, like compute average, min, max or even do more regressions over resulting columns. In my use case, the length will always be the same, but the exact length is not static - dependent on the length of xvars. – lebesgue Mar 04 '23 at 06:52
  • See new implementation (towards bottom) - it does exactly what you want and gives meaningful names. – Cornelius Roemer Mar 04 '23 at 07:46
0

As @jqurious points out in the comments, this may well be a bug if there's a difference in behaviour between lazy and eager.

However, there is a hint at what may be going on in the documentation of arr.to_struct():

docs

In particular, upper_bound points to special requirements of a LazyFrame: needing to know the schema at all time.

If we look at intermediate output of your query (running in eager mode), after this point:

res = df.groupby("day").agg(
    pl.struct(["y", "x1", "x2"])
    .apply(partial(ols_stats, yvar="y", xvars=["x1", "x2"]))
    .alias("params")
)

res looks like this:

shape: (3, 2)
day params
i64 list[list[f64]]
1   [[4.866232, 0.640294, -0.659869], [1.547251, 1.81586, -1.430613]]
3   [[0.5, 0.5], [0.0, 0.0]]
2   [[2.0462, 0.223971, 0.336793], [1.524834, 0.495378, 1.091109]]

Note that the sublists of day 3 have only length 2 in contrast to the sublists of day 1 and 2.

This is almost definitely something you don't want to happen, so Polars throwing an error is maybe not such a bad thing.

In the last step, you now want to turn these nested lists into a struct. But depending on which day comes first, you will get different results? In fact if you run the first part multiple times, sometimes day 3 will comes first. Then to_struct will actually result in 4 fields with default settings.

Try it yourself. If day 3 comes first, like this:

shape: (3, 2)
┌─────┬─────────────────────────────────────┐
│ day ┆ params                              │
│ --- ┆ ---                                 │
│ i64 ┆ list[list[f64]]                     │
╞═════╪═════════════════════════════════════╡
│ 3   ┆ [[0.5, 0.5], [0.0, 0.0]]            │
│ 1   ┆ [[4.866232, 0.640294, -0.659869]... │
│ 2   ┆ [[2.0462, 0.223971, 0.336793], [... │
└─────┴─────────────────────────────────────┘

the last step will result in the following struct:

shape: (3, 5)
┌─────┬──────────┬──────────┬───────────┬──────────┐
│ day ┆ field_0  ┆ field_1  ┆ field_2   ┆ field_3  │
│ --- ┆ ---      ┆ ---      ┆ ---       ┆ ---      │
│ i64 ┆ f64      ┆ f64      ┆ f64       ┆ f64      │
╞═════╪══════════╪══════════╪═══════════╪══════════╡
│ 3   ┆ 0.5      ┆ 0.5      ┆ 0.0       ┆ 0.0      │
│ 2   ┆ 2.0462   ┆ 0.223971 ┆ 0.336793  ┆ 1.524834 │
│ 1   ┆ 4.866232 ┆ 0.640294 ┆ -0.659869 ┆ 1.547251 │
└─────┴──────────┴──────────┴───────────┴──────────┘

By chance, you can get a different order after aggregation:

shape: (3, 2)
┌─────┬─────────────────────────────────────┐
│ day ┆ params                              │
│ --- ┆ ---                                 │
│ i64 ┆ list[list[f64]]                     │
╞═════╪═════════════════════════════════════╡
│ 2   ┆ [[2.0462, 0.223971, 0.336793], [... │
│ 1   ┆ [[4.866232, 0.640294, -0.659869]... │
│ 3   ┆ [[0.5, 0.5], [0.0, 0.0]]            │
└─────┴─────────────────────────────────────┘

This will lead to the following struct:

shape: (3, 7)
┌─────┬──────────┬──────────┬───────────┬──────────┬──────────┬───────────┐
│ day ┆ field_0  ┆ field_1  ┆ field_2   ┆ field_3  ┆ field_4  ┆ field_5   │
│ --- ┆ ---      ┆ ---      ┆ ---       ┆ ---      ┆ ---      ┆ ---       │
│ i64 ┆ f64      ┆ f64      ┆ f64       ┆ f64      ┆ f64      ┆ f64       │
╞═════╪══════════╪══════════╪═══════════╪══════════╪══════════╪═══════════╡
│ 2   ┆ 2.0462   ┆ 0.223971 ┆ 0.336793  ┆ 1.524834 ┆ 0.495378 ┆ 1.091109  │
│ 1   ┆ 4.866232 ┆ 0.640294 ┆ -0.659869 ┆ 1.547251 ┆ 1.81586  ┆ -1.430613 │
│ 3   ┆ 0.5      ┆ 0.5      ┆ 0.0       ┆ 0.0      ┆ null     ┆ null      │
└─────┴──────────┴──────────┴───────────┴──────────┴──────────┴───────────┘

So I think this non-deterministic nature of your query may be at the root of it not working in lazy mode. But do open an issue.

Cornelius Roemer
  • 3,772
  • 1
  • 24
  • 55
  • I just edited my post. I have the same concern about the day 3 being a problematic case. But when I delete day 3 completely, running the code gives me the same error. Post is updated. – lebesgue Mar 04 '23 at 04:48