3

I have a dataframe with 2 columns, where first column contains lists, and second column integer indexes. How to get elements from first column by index specified in second column? Or even better, put that element in 3rd column. So for example, how from this

a = pl.DataFrame([{'lst': [1, 2, 3], 'ind': 1}, {'lst': [4, 5, 6], 'ind': 2}])
┌───────────┬─────┐
│ lst       ┆ ind │
│ ---       ┆ --- │
│ list[i64] ┆ i64 │
╞═══════════╪═════╡
│ [1, 2, 3] ┆ 1   │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ [4, 5, 6] ┆ 2   │
└───────────┴─────┘

you can get this

b = pl.DataFrame([{'lst': [1, 2, 3], 'ind': 1, 'list[ind]': 2}, {'lst': [4, 5, 6], 'ind': 2, 'list[ind]': 6}])
┌───────────┬─────┬───────────┐
│ lst       ┆ ind ┆ list[ind] │
│ ---       ┆ --- ┆ ---       │
│ list[i64] ┆ i64 ┆ i64       │
╞═══════════╪═════╪═══════════╡
│ [1, 2, 3] ┆ 1   ┆ 2         │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ [4, 5, 6] ┆ 2   ┆ 6         │
└───────────┴─────┴───────────┘

Thanks.

Kaster
  • 357
  • 4
  • 16

3 Answers3

8

Edit

As of python polars 0.14.24 this can be done more easily by

df.with_column(pl.col("lst").arr.get(pl.col("ind")).alias("list[ind]"))

Original answer

You can use with_row_count() to add a row count column for grouping, then explode() the list so each list element is on each row. Then call take() over the row count column using over() to select the element from the subgroup.

df = pl.DataFrame({"lst": [[1, 2, 3], [4, 5, 6]], "ind": [1, 2]})

df = (
    df.with_row_count()
    .with_column(
        pl.col("lst").explode().take(pl.col("ind")).over(pl.col("row_nr")).alias("list[ind]")
    )
    .drop("row_nr")
)
shape: (2, 3)
┌───────────┬─────┬───────────┐
│ lst       ┆ ind ┆ list[ind] │
│ ---       ┆ --- ┆ ---       │
│ list[i64] ┆ i64 ┆ i64       │
╞═══════════╪═════╪═══════════╡
│ [1, 2, 3] ┆ 1   ┆ 2         │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ [4, 5, 6] ┆ 2   ┆ 6         │
└───────────┴─────┴───────────┘
cccs31
  • 138
  • 5
  • This is ingenious. Thanks. I'll test its performance and report back. – Kaster Oct 26 '22 at 07:40
  • Pandas solution (df['lst'].str[0]) does not make much sense but it is still much better than such a complicated solution :/ – the_economist Jan 21 '23 at 21:30
  • this was changed in polars - .arr was renamed to .list --> https://pola-rs.github.io/polars/py-polars/html/reference/expressions/list.html – genegc Aug 18 '23 at 16:24
2

Here is my approach:

Create a custom function to get the values as per the required index.

def get_elem(d):
    sel_idx = d[0]
    return d[1][sel_idx]

here is a test data.

df = pl.DataFrame({'lista':[[1,2,3],[4,5,6]],'idx':[1,2]})

Now lets create a struct on these two columns(it will create a dict) and apply an above function

df.with_columns([
    pl.struct(['idx','lista']).apply(lambda x: get_elem(list(x.values()))).alias('req_elem')])
shape: (2, 3)
┌───────────┬─────┬──────────┐
│ lista     ┆ idx ┆ req_elem │
│ ---       ┆ --- ┆ ---      │
│ list[i64] ┆ i64 ┆ i64      │
╞═══════════╪═════╪══════════╡
│ [1, 2, 3] ┆ 1   ┆ 2        │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ [4, 5, 6] ┆ 2   ┆ 6        │
└───────────┴─────┴──────────┘
myamulla_ciencia
  • 1,282
  • 1
  • 8
  • 30
  • Thanks. Let's wait if it's possible using native polars api for performance reasons. My actual dataframe is pretty big so performance matters. – Kaster Oct 26 '22 at 06:12
0

If your number of unique idx elements isn't absolutely massive, you can build a when/then expression to select based on the value of idx using list.get(idx):

import polars as pl

df = pl.DataFrame([{"lst": [1, 2, 3], "ind": 1}, {"lst": [4, 5, 6], "ind": 2}])

# create when/then expression for each unique index
idxs = df["ind"].unique()
ind, lst = pl.col("ind"), pl.col("lst") # makes expression generator look cleaner

expr = pl.when(ind == idxs[0]).then(lst.arr.get(idxs[0]))
for idx in idxs[1:]:
    expr = expr.when(ind == idx).then(lst.arr.get(idx))
expr = expr.otherwise(None)

df.select(expr)
shape: (2, 1)
┌─────┐
│ lst │
│ --- │
│ i64 │
╞═════╡
│ 2   │
├╌╌╌╌╌┤
│ 6   │
└─────┘
NedDasty
  • 192
  • 1
  • 8