3

I have a problem to merge columns into one. Say I have a dataframe (df) like below:

>> print(df)

shape: (3, 4)
┌─────┬───────┬───────┬───────┐
│ a   ┆ b_a_1 ┆ b_a_2 ┆ b_a_3 │
│ --- ┆ ---   ┆ ---   ┆ ---   │
│ i64 ┆ str   ┆ str   ┆ str   │
╞═════╪═══════╪═══════╪═══════╡
│ 1   ┆ a--   ┆       ┆       │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 1   ┆       ┆ b--   ┆       │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 1   ┆       ┆       ┆ c--   │
└─────┴───────┴───────┴───────┘

And I want to be able to merge the last three (3) columsn into one using python-polars. I have tried and successfully got what I want. However,

>> out = df.select(pl.concat_str(['b_a_1', 'b_a_2', 'b_a_3']).alias('b_a'))
>> print(out)

shape: (3, 1)
┌─────┐
│ b_a │
│ --- │
│ str │
╞═════╡
│ a-- │
├╌╌╌╌╌┤
│ b-- │
├╌╌╌╌╌┤
│ c-- │
└─────┘

when I use regex in selecting the columns, I don't get the above result

>> out = df.select(pl.concat_str('^b_a_\d$'))
>> print(out)

shape: (3, 3)
┌───────┬───────┬───────┐
│ b_a_1 ┆ b_a_2 ┆ b_a_3 │
│ ---   ┆ ---   ┆ ---   │
│ str   ┆ str   ┆ str   │
╞═══════╪═══════╪═══════╡
│ a--   ┆       ┆       │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│       ┆ b--   ┆       │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│       ┆       ┆ c--   │
└───────┴───────┴───────┘

and nothing when run

>> out = df.select(pl.concat_str('^b_a_*$'))
>> print(out)

shape: (0, 0)
┌┐
╞╡
└┘

How am I to select the columns with regex and combine them into one?

Thank you very much for your time and suggestion.

Sincerely, Thi An

Thi An
  • 47
  • 1
  • 5
  • 2
    We need to adapt `concat_str` so that it expands the regex to an input. I will do that in next release. – ritchie46 May 28 '22 at 05:26
  • A workaround solution I was doing for this issue is to first get the pattern compiled `import re; patt = re.compile('^b_a_*');` and then get all columns that match that pattern by `cols = [c for c in df.columns if patt.search(c)]`. Finally, used all `cols` in `pl.concat_str(cols)`. – Thi An May 30 '22 at 02:52

1 Answers1

2

Since the current behavior of polars.concat_str when joining str with a null is to output a null, a possible workaround is to use .fill_null, replacing Nulls by empty strings on the relevant columns.

(
df.select([
    pl.concat_str(
        pl.col("^b_a_\d$").fill_null("").alias("b_a")
        )
    ])
)

shape: (3, 1)
┌─────┐
│ b_a │
│ --- │
│ str │
╞═════╡
│ a-- │
├╌╌╌╌╌┤
│ b-- │
├╌╌╌╌╌┤
│ c-- │
└─────┘