0

In (Py)Polars there is method of subset list elements in column of lists according to list of indexes in other column? I.e. arr.get() accepts only Integer and not accept Expressions (like pl.col('prices').arr.get(pl.col('idxs').arr.first())) ?

Can I get some like:

df = pl.DataFrame(
{'idxs': [[0], [1], [0, 2]], 
 'prices': [[0.0, 3.5], [4.6, 0.0], [0.0, 7.8, 0.0]]}
)

(df
  .with_column(
       pl.col('prices').arr.get(pl.col('idxs')).alias('zero_prices') 

)
)

Can be resolved with apply UDF python function to pl.struct(pl.all())

Like

def get_zero_prices(cols):
    return [float(el) for i, el in enumerate(cols['prices']) if I in cols['idxs']]

(df
  .with_column(
       pl.struct(pl.all()).apply(lambda x: get_zero_prices(x)).alias('zero_prices') 

)
)

But this looks not so ideomatic

2 Answers2

0

What you want is to be able to utilize the full expression API whilst operating on certain sub-elements or groups. That's what a groupby is!

So ideally we groom our DataFrame in a state where very group corresponds to the elements of our lists.

First we start with some data and and then we add a row_idx that will represent out unique groups.

df = pl.DataFrame({
 "idx": [[0], [1], [0, 2]], 
 "array": [["a", "b"], ["c", "d"], ["e", "f", "g"]]
}).with_row_count("row_nr")
print(df)

shape: (3, 3)
┌────────┬───────────┬─────────────────┐
│ row_nr ┆ idx       ┆ array           │
│ ---    ┆ ---       ┆ ---             │
│ u32    ┆ list[i64] ┆ list[str]       │
╞════════╪═══════════╪═════════════════╡
│ 0      ┆ [0]       ┆ ["a", "b"]      │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1      ┆ [1]       ┆ ["c", "d"]      │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2      ┆ [0, 2]    ┆ ["e", "f", "g"] │
└────────┴───────────┴─────────────────┘

Next we explode by the "idx" column so that we can we create the groups for our groupby.

df = df.explode("idx")
print(df)
shape: (4, 3)
┌────────┬─────┬─────────────────┐
│ row_nr ┆ idx ┆ array           │
│ ---    ┆ --- ┆ ---             │
│ u32    ┆ i64 ┆ list[str]       │
╞════════╪═════╪═════════════════╡
│ 0      ┆ 0   ┆ ["a", "b"]      │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1      ┆ 1   ┆ ["c", "d"]      │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2      ┆ 0   ┆ ["e", "f", "g"] │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2      ┆ 2   ┆ ["e", "f", "g"] │
└────────┴─────┴─────────────────┘

Finally we can apply the groupby and take the subelements for each list/group.


(df
 .groupby("row_nr")
 .agg([
     pl.col("array").first(),
     pl.col("idx"),
     pl.col("array").first().take(pl.col("idx")).alias("arr_taken")
 ])
)

This returns:

shape: (3, 4)
┌────────┬─────────────────┬───────────┬────────────┐
│ row_nr ┆ array           ┆ idx       ┆ arr_taken  │
│ ---    ┆ ---             ┆ ---       ┆ ---        │
│ u32    ┆ list[str]       ┆ list[i64] ┆ list[str]  │
╞════════╪═════════════════╪═══════════╪════════════╡
│ 0      ┆ ["a", "b"]      ┆ [0]       ┆ ["a"]      │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1      ┆ ["c", "d"]      ┆ [1]       ┆ ["d"]      │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2      ┆ ["e", "f", "g"] ┆ [0, 2]    ┆ ["e", "g"] │
└────────┴─────────────────┴───────────┴────────────┘
ritchie46
  • 10,405
  • 1
  • 24
  • 43
0

It's an old question, but you can do this now directly with the listNameSpace expression take:


df.with_columns(pl.col('array').list.take(pl.col('idx')))

shape: (3, 3)
┌────────┬───────────┬────────────┐
│ row_nr ┆ idx       ┆ array      │
│ ---    ┆ ---       ┆ ---        │
│ u32    ┆ list[i64] ┆ list[str]  │
╞════════╪═══════════╪════════════╡
│ 0      ┆ [0]       ┆ ["a"]      │
│ 1      ┆ [1]       ┆ ["d"]      │
│ 2      ┆ [0, 2]    ┆ ["e", "g"] │
└────────┴───────────┴────────────┘
elgreco
  • 132
  • 6