2

I am new to both Polars and Python in general. I have a somewhat unusual problem I could use some help with. I have a dataframe with 50 plus columns that are 0/1. I need to create a new column that contains comma separated list of each column that contains a 1 but using part of the column name. If hccx = 1 then append x to a string column. A simplified example:

df=pl.DataFrame(
    {'id':[1,2,3], 'hcc1':[0,1,1],'hcc2':[0,0,1],'hcc5':[0,1,1],'hcc8':[1,0,0]}
)
shape: (3, 5)
┌─────┬──────┬──────┬──────┬──────┐
│ id  ┆ hcc1 ┆ hcc2 ┆ hcc5 ┆ hcc8 │
│ --- ┆ ---  ┆ ---  ┆ ---  ┆ ---  │
│ i64 ┆ i64  ┆ i64  ┆ i64  ┆ i64  │
╞═════╪══════╪══════╪══════╪══════╡
│ 1   ┆ 0    ┆ 0    ┆ 0    ┆ 1    │
│ 2   ┆ 1    ┆ 0    ┆ 1    ┆ 0    │
│ 3   ┆ 1    ┆ 1    ┆ 1    ┆ 0    │
└─────┴──────┴──────┴──────┴──────┘

I want to create a new column (string type), hccall, that looks like the following:

id hccall
1 8
2 1,5
3 1,2,5

I imagine some type of list comprehension looping over columns that start with 'hcc' would work but I'm kind of stuck. I can create a loop but not sure how to append to the column from within the loop. Any slick ideas?

Mark Rotteveel
  • 100,966
  • 191
  • 140
  • 197
craigm
  • 137
  • 6

3 Answers3

3

I think the easiest option is to first melt your dataframe so that you a row per id per column in your dataframe. You can than filter the rows that are equal to 1. You can than aggregate the last letter of the original rows into a list. With list.join you can combine the list into a string separated by commas

(
    df.melt(id_vars="id")
    .filter(pl.col("value") == 1)
    .groupby("id")
    .agg(pl.col("variable").str.slice(3))
    .with_columns(pl.col("variable").list.join(","))
)

shape: (3, 2)
┌─────┬──────────┐
│ id  ┆ variable │
│ --- ┆ ---      │
│ i64 ┆ str      │
╞═════╪══════════╡
│ 2   ┆ 1,5      │
│ 3   ┆ 1,2,5    │
│ 1   ┆ 8        │
└─────┴──────────┘
Mondo30003
  • 151
  • 3
2

Here's another way to do it without the melt or groupby

(
    df
    .with_columns(
        hccall=pl.concat_list(
                    pl.when(pl.col(x)==1)
                    .then(pl.lit(x.replace('hcc',', ')))
                    .otherwise(pl.lit('')) for x in df.columns if x[:3]=='hcc')
                .list.join("")
                .str.replace("^, ", "")
                    )
)

I can't figure out how to elegantly drop the nulls in the concat_list so it's got the convoluted str.replace at the end. I think there is a succinct way to do that but it escapes me for the moment.

Dean MacGregor
  • 11,847
  • 9
  • 34
  • 72
  • This solution also works. Thank you! This is more along the lines of how I originally thought the solution would look. The melt solution was completely outside the box from how I was approaching the problem. I will study both solutions as I think they are great teaching tools for me. – craigm Aug 24 '23 at 23:05
  • Not sure if there is a simpler way other than `.list.eval(pl.element().filter(pl.element().is_not_null()))` - perhaps a `.list.drop_nulls()` could/should exist. – jqurious Aug 25 '23 at 02:24
0

Probably not an approach one would use, but I found it interesting:

columns = dict(enumerate(df.columns[1:]))

df.select('id', hccall = 
   pl.concat_list(pl.exclude('id') == 1)
     .list.eval(pl.element().arg_true().map_dict(columns).str.slice(3))
     .list.join(',')
)
shape: (3, 2)
┌─────┬────────┐
│ id  ┆ hccall │
│ --- ┆ ---    │
│ i64 ┆ str    │
╞═════╪════════╡
│ 1   ┆ 8      │
│ 2   ┆ 1,5    │
│ 3   ┆ 1,2,5  │
└─────┴────────┘
jqurious
  • 9,953
  • 1
  • 4
  • 14