3

Suppose I have a simple dataframe as manually generated by the code below:

cols=['a','b','c']
values=['d','e','f']
df=(pl.DataFrame({cols[i]:[values[i]]*3 for i in range(len(cols))})
    .with_columns(pl.lit(pl.Series(['a,b','b,c','a,c']))
                  .alias('Columns to Concatenate'))
   )

Which produces a table as below:

a b c Columns to Concatenate
d e f a,b
d e f b,c
d e f a,c

How would I concatenate all columns as described in the 'Columns to Concatenate' column in order to produce a result like below:

a b c Columns to Concatenate Concatenated Column String
d e f a,b de
d e f b,c ef
d e f a,c df

I've attempted to do it as such:

(df.with_columns(
    pl.concat_str(pl.col('Columns to Concatenate').str.split(','))
    .alias('Concatenated Column String'))
)

Which I'm pretty sure is not the correct way of doing this, and is returning a

ComputeError: Cannot cast list type

Would appreciate some pointers on how to do this in an idiomatic and fast way without having to resort to a row-wise lambda function.

sjs
  • 53
  • 3

3 Answers3

1

This uses list comprehension:

import polars as pl

df = pl.DataFrame({ 'a': ['d', 'd', 'd'],
                    'b': ['e', 'e', 'e'],
                    'c': ['f', 'f', 'f'],
                    'Columns to Concatenate': ['a,b', 'b,c', 'a,b,c']})



vals = ["".join([df.get_column(col)[i] for col in col_list]) for i, col_list in enumerate(df.get_column('Columns to Concatenate').str.split(','))]
df = df.hstack([pl.Series('Concatenated Column String', vals)])
print(df)

Output:

shape: (3, 5)
┌─────┬─────┬─────┬────────────────────────┬────────────────────────────┐
│ a   ┆ b   ┆ c   ┆ Columns to Concatenate ┆ Concatenated Column String │
│ --- ┆ --- ┆ --- ┆ ---                    ┆ ---                        │
│ str ┆ str ┆ str ┆ str                    ┆ str                        │
╞═════╪═════╪═════╪════════════════════════╪════════════════════════════╡
│ d   ┆ e   ┆ f   ┆ a,b                    ┆ de                         │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ d   ┆ e   ┆ f   ┆ b,c                    ┆ ef                         │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ d   ┆ e   ┆ f   ┆ a,b,c                  ┆ def                        │
└─────┴─────┴─────┴────────────────────────┴────────────────────────────┘

Note:

Here is the same thing but as a one-liner:

df = df.hstack([pl.Series('Concatenated Column String', ["".join([df.get_column(col)[i] for col in col_list]) for i, col_list in enumerate(df.get_column('Columns to Concatenate').str.split(','))])])
ScottC
  • 3,941
  • 1
  • 6
  • 20
0

It's not clear if the a/b/c columns change (the way you construct the initial df seems like they're fixed but not sure if that's just a simplification) or if they're fixed so the following assumes that they're dynamic. Also, I don't know if you need to get the original order back so the following also assumes that you do.

(    df 
    # add column to preserve order
    .with_row_count('myid') 
    # melt the df so that the column names are now values that can be used in expressions
    .melt(id_vars=['Columns to Concatenate','myid']) 
    # split the string into a list
    .with_column(pl.col('Columns to Concatenate').str.split(',').alias('cols')) 
    # make a column to keep the original order of the cols using a hack
    # The hack makes a condition where essentially 1=1 and then take the cumulative sum to make relative index position
    .with_column(pl.col('cols').arr.eval((pl.element()==pl.element()).cumsum()).alias('colorder'))
    # convert the list values into a row per list item
    .explode(['cols','colorder'])
    # sort by myid and colorder
    .sort(['myid','colorder'])
    # only keep instances when desired value matches
    .filter(pl.col('variable')==pl.col('cols')) 
    # group by the original df "key"
    .groupby(['Columns to Concatenate','myid']) 
    # smash all the desired values together
    .agg(pl.col('value').str.concat(delimiter="").alias("Concatenated Column String"))
    # join to the original df to get the a/b/c columns back 
    .join(df.with_row_count('myid'), on=['Columns to Concatenate','myid'])
    # put the rows in the order that they started
    .sort('myid') 
    # put the columns in the order they started with the new column at the end
    .select([pl.col(x) for x in df.columns] + [pl.col('Concatenated Column String')]) 
)
Dean MacGregor
  • 11,847
  • 9
  • 34
  • 72
0
source = "Columns to Concatenate"                                                                         
result = "Concatenated Column String" 
# get names of str type columns
names = df.select( 
   pl.col(pl.Utf8).exclude(source)
).columns      
                                                                  
(                                                                                                                 
   df                                                                                                             
   .with_row_count()                                                                                              
   .with_column(                                                                                                  
      pl.col("Columns to Concatenate")
        .str.split(",").alias("name"))                                                              
   .explode("name")                                                                                               
   .with_column(                                                                                                  
      pl.coalesce(                                                                                                
         pl.when(pl.col("name") == name)                                                                          
           .then(pl.col(name))                                                                                    
         for name in names)                                                                                  
      .alias(result))                                                                                             
   .groupby("row_nr")                                                                                             
   .agg([                                                                                                         
      pl.exclude(result).first(),                                                                                 
      pl.col(result).str.concat(""),                                                                              
   ])                                                                                                             
   .drop(["row_nr", "name"])                                                                                      
)                                                                                                                 
shape: (3, 5)                                                                                                     
┌─────┬─────┬─────┬────────────────────────┬────────────────────────────┐                                         
│ a   | b   | c   | Columns to Concatenate | Concatenated Column String │                                         
│ --- | --- | --- | ---                    | ---                        │                                         
│ str | str | str | str                    | str                        │                                         
╞═════╪═════╪═════╪════════════════════════╪════════════════════════════╡                                         
│ d   | e   | f   | a,b                    | de                         │                                         
├─────┼─────┼─────┼────────────────────────┼────────────────────────────┤                                         
│ d   | e   | f   | b,c                    | ef                         │                                         
├─────┼─────┼─────┼────────────────────────┼────────────────────────────┤                                         
│ d   | e   | f   | a,c                    | df                         │                                         
└─//──┴─//──┴─//──┴─//─────────────────────┴─//─────────────────────────┘
jqurious
  • 9,953
  • 1
  • 4
  • 14