Can I get elements from column of lists by list of indexes?

Question

In (Py)Polars there is method of subset list elements in column of lists according to list of indexes in other column? I.e. arr.get() accepts only Integer and not accept Expressions (like pl.col('prices').arr.get(pl.col('idxs').arr.first())) ?

Can I get some like:

df = pl.DataFrame(
{'idxs': [[0], [1], [0, 2]], 
 'prices': [[0.0, 3.5], [4.6, 0.0], [0.0, 7.8, 0.0]]}
)

(df
  .with_column(
       pl.col('prices').arr.get(pl.col('idxs')).alias('zero_prices') 

)
)

Can be resolved with apply UDF python function to `pl.struct(pl.all())`

Like

def get_zero_prices(cols):
    return [float(el) for i, el in enumerate(cols['prices']) if I in cols['idxs']]

(df
  .with_column(
       pl.struct(pl.all()).apply(lambda x: get_zero_prices(x)).alias('zero_prices') 

)
)

But this looks not so ideomatic

score 0 · Accepted Answer · answered Sep 01 '22 at 13:48

What you want is to be able to utilize the full expression API whilst operating on certain sub-elements or groups. That's what a groupby is!

So ideally we groom our DataFrame in a state where very group corresponds to the elements of our lists.

First we start with some data and and then we add a row_idx that will represent out unique groups.

df = pl.DataFrame({
 "idx": [[0], [1], [0, 2]], 
 "array": [["a", "b"], ["c", "d"], ["e", "f", "g"]]
}).with_row_count("row_nr")
print(df)

shape: (3, 3)
┌────────┬───────────┬─────────────────┐
│ row_nr ┆ idx       ┆ array           │
│ ---    ┆ ---       ┆ ---             │
│ u32    ┆ list[i64] ┆ list[str]       │
╞════════╪═══════════╪═════════════════╡
│ 0      ┆ [0]       ┆ ["a", "b"]      │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1      ┆ [1]       ┆ ["c", "d"]      │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2      ┆ [0, 2]    ┆ ["e", "f", "g"] │
└────────┴───────────┴─────────────────┘

Next we explode by the "idx" column so that we can we create the groups for our groupby.

df = df.explode("idx")
print(df)

shape: (4, 3)
┌────────┬─────┬─────────────────┐
│ row_nr ┆ idx ┆ array           │
│ ---    ┆ --- ┆ ---             │
│ u32    ┆ i64 ┆ list[str]       │
╞════════╪═════╪═════════════════╡
│ 0      ┆ 0   ┆ ["a", "b"]      │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1      ┆ 1   ┆ ["c", "d"]      │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2      ┆ 0   ┆ ["e", "f", "g"] │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2      ┆ 2   ┆ ["e", "f", "g"] │
└────────┴─────┴─────────────────┘

Finally we can apply the groupby and take the subelements for each list/group.


(df
 .groupby("row_nr")
 .agg([
     pl.col("array").first(),
     pl.col("idx"),
     pl.col("array").first().take(pl.col("idx")).alias("arr_taken")
 ])
)

This returns:

shape: (3, 4)
┌────────┬─────────────────┬───────────┬────────────┐
│ row_nr ┆ array           ┆ idx       ┆ arr_taken  │
│ ---    ┆ ---             ┆ ---       ┆ ---        │
│ u32    ┆ list[str]       ┆ list[i64] ┆ list[str]  │
╞════════╪═════════════════╪═══════════╪════════════╡
│ 0      ┆ ["a", "b"]      ┆ [0]       ┆ ["a"]      │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1      ┆ ["c", "d"]      ┆ [1]       ┆ ["d"]      │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2      ┆ ["e", "f", "g"] ┆ [0, 2]    ┆ ["e", "g"] │
└────────┴─────────────────┴───────────┴────────────┘

score 0 · Answer 2 · answered Jul 25 '23 at 22:00

It's an old question, but you can do this now directly with the listNameSpace expression take:


df.with_columns(pl.col('array').list.take(pl.col('idx')))

shape: (3, 3)
┌────────┬───────────┬────────────┐
│ row_nr ┆ idx       ┆ array      │
│ ---    ┆ ---       ┆ ---        │
│ u32    ┆ list[i64] ┆ list[str]  │
╞════════╪═══════════╪════════════╡
│ 0      ┆ [0]       ┆ ["a"]      │
│ 1      ┆ [1]       ┆ ["d"]      │
│ 2      ┆ [0, 2]    ┆ ["e", "g"] │
└────────┴───────────┴────────────┘

Can I get elements from column of lists by list of indexes?

Can be resolved with apply UDF python function to pl.struct(pl.all())

2 Answers2

Can be resolved with apply UDF python function to `pl.struct(pl.all())`