2

In python polars, I was wondering if it will be possible to use .eval() to perform an operation between an element and a column. For example, given the following dataframe:

import polars as pl

df = pl.DataFrame({"list": [[2, 2, 2], [3, 3, 3]], "scalar": [1, 2]})

Is it possible to subtract each element of the list column by the value of scalar column? i.e. from this

shape: (2, 2)
┌───────────┬────────┐
│ list      ┆ scalar │
│ ---       ┆ ---    │
│ list[i64] ┆ i64    │
╞═══════════╪════════╡
│ [2, 2, 2] ┆ 1      │
│ [3, 3, 3] ┆ 2      │
└───────────┴────────┘

to this

shape: (2, 3)
┌───────────┬────────┬───────────┐
│ list      ┆ scalar ┆ diff      │
│ ---       ┆ ---    ┆ ---       │
│ list[i64] ┆ i64    ┆ list[i64] │
╞═══════════╪════════╪═══════════╡
│ [2, 2, 2] ┆ 1      ┆ [1, 1, 1] │
│ [3, 3, 3] ┆ 2      ┆ [1, 1, 1] │
└───────────┴────────┴───────────┘

3 Answers3

2

I think that native functionality for this is on the roadmap (see this github issue https://github.com/pola-rs/polars/issues/8006) but you can do this as follows:

df = df.with_row_count().pipe(
    lambda df: df.join(
        df.explode("list")
        .with_columns(sub=pl.col("list") - pl.col("scalar"))
        .groupby("row_nr")
        .agg(pl.col("sub")),
        on="row_nr",
    )
)

Basically, I add a row_nr column to have a unique ID for each row. Then I pipe so I can use this row_nr column in further operations. I do a join to add the arithmetic column. In the join I explode the list column to get it as rows, do the arithmetic then do a groupby to gather things back into a list for each row and join this new column back to the df.

I'm sure there are other ways to do it but this should get you going

braaannigan
  • 594
  • 4
  • 12
2

I'm pretty sure there is a better way to do this, but this works:

(
    df.with_columns(
        pl.lit(True).alias("idx"),
    )
    .with_columns(
        pl.cumsum("idx")
    )
    .explode("list")
    .with_columns(
        (pl.col("list") - pl.col("scalar")).alias("diff")
    )
    .groupby("idx")
    .agg(
        "list",
        "diff",
        pl.col("scalar").first()
    )
    .select(
        "list",
        "scalar",
        "diff",
    )
)

output:

shape: (2, 3)
┌───────────┬────────┬───────────┐
│ list      ┆ scalar ┆ diff      │
│ ---       ┆ ---    ┆ ---       │
│ list[i64] ┆ i64    ┆ list[i64] │
╞═══════════╪════════╪═══════════╡
│ [3, 3, 3] ┆ 2      ┆ [1, 1, 1] │
│ [2, 2, 2] ┆ 1      ┆ [1, 1, 1] │
└───────────┴────────┴───────────┘
Matt
  • 183
  • 1
  • 6
2

Here are 2 options for you. The first is a mix of the existing two solutions. It saves the lambda and self join presented in one, and is a bit more concise than the other.

(
    df
    .with_row_count()
    .explode("list")
    .groupby("row_nr")
    .agg(
        "list",
        # first, min, max, etc. will all work here 
        # we just need a single value rather than to build them into a list
        pl.col("scalar").first(),
        # do the subtraction directly in the aggregation 
        # (I love that polars lets you alias inside an agg)
        (pl.col("list") - pl.col("scalar")).alias("diff"),
    )
    .drop("row_nr")
)

And another solution that is perhaps less clear in intent (due to the struct conversion) and relies on each list having the same number of elements (otherwise it produces unexpected results without failing), but is ever so slightly shorter. It really just replaces the explode and groupby in the above with unnest and concat_list.

(
    df
    .with_columns(pl.col("list").list.to_struct().alias("struct"))
    # this works if your lists are consistent lengths 
    # you will get a column for each element called field_n
    .unnest("struct")
    .select(
        "list", 
        "scalar",
        # regex to match all of the `field_n` columns
        # also assumes there are no other columns in the df 
        # that match this pattern and did not come from the `unnest`
        pl.concat_list(pl.col("^field_\d+$") - pl.col("scalar")).alias("diff")
    )
)
  • Incase it's useful: you can `.to_struct("max_width")` and `.concat_list(...).list.eval(pl.element().drop_nulls())` to make it work for different lengths. By default `.to_struct()` takes the width of the first non-null row which will truncate any longer results. – jqurious Aug 02 '23 at 22:45
  • Thanks @jqurious, I wasn't aware of that. That is useful indeed and an improvement to the solution! – Henry Harbeck Aug 03 '23 at 13:31