Is there a way to group by lambda function in pyspark pandas

Question

I originally used the below code to work with a standard pandas df. Switched to pyspark pandas df once data grew. I've been unable to make this groupby work on the pyspark pandas df. I've also tried to replicate on a spark df using spark functions, but my knowledge there is limited so I haven't had any luck. Any tips/advice would be much appreciated.

df1 = df1.groupby(['FISCAL_YEAR', 'FISCAL_MONTH', 'FISCAL_WEEK','ORDER_NUMBER', 'LINE_TYPE'], as_index=False).agg({
                                                                                          'DEPARTMENT': lambda x: ' | '.join(sorted(x.unique()))
                                                                                          ,'Dept_Subdept': lambda x: ' | '.join(sorted(x.unique()))
                                                                                          ,'Demand': 'sum'
                                                                                          ,'COGS': 'sum'
                                                                                          ,'Units':'sum'
                                                                                          })

ValueError: aggs must be a dict mapping from column name to aggregate functions (string or list of strings).

Lambdas won't work here. I think you need a combination of [collect_set](https://spark.apache.org/docs/3.3.0/api/sql/index.html#collect_set), [array_sort](https://spark.apache.org/docs/3.3.0/api/sql/index.html#array_sort) and [concat_ws](https://spark.apache.org/docs/3.3.0/api/sql/index.html#concat_ws) — werner, Sep 13 '22 at 16:10
https://stackoverflow.com/questions/62664691/pyspark-using-agg-to-concat-string-after-groupby — BeRT2me, Sep 13 '22 at 16:20

BeRT2me · Answer 1 · 2022-09-13T19:07:34.933

0

import pyspark.sql.functions as F

df1 = (df1.groupby(['FISCAL_YEAR', 
                    'FISCAL_MONTH', 
                    'FISCAL_WEEK',
                    'ORDER_NUMBER', 
                    'LINE_TYPE'])
          .agg(F.concat_ws(' | ', F.collect_set('DEPARTMENT')).alias('DEPARTMENTS'),
               F.concat_ws(' | ', F.collect_set('Dept_Subdept')).alias('Dept_Subdepts'),
               F.sum('Demand').alias('Total_Demand'),
               F.sum('COGS').alias('Total_COGS'),
               F.sum('Units').alias('Total_Units')))

edited Sep 13 '22 at 19:07

answered Sep 13 '22 at 16:32

BeRT2me

12,699
2
13
31

They're not sorted, but hopefully this gives you something to start with~ – BeRT2me Sep 13 '22 at 16:34
Appreciate the help. Unfortunately I'm getting this error: 'DataFrame' object has no attribute 'to_spark' – cmb66 Sep 13 '22 at 17:48
Oh, then you're not using the `pyspark.pandas` api like I assumed. I removed the parts specific to that. – BeRT2me Sep 13 '22 at 19:07
Worked. Thanks a ton. Now one more question: what if I wanted to apply a function(r) to df1? Typically, I'd define the function and then df1['NewCol'] = df1.apply(function, axis=1), but I don't believe that will work in this instance. – cmb66 Sep 13 '22 at 20:38
I think you'll have to figure out [udf](https://spark.apache.org/docs/3.1.1/api/python/reference/api/pyspark.sql.functions.udf.html) – BeRT2me Sep 13 '22 at 21:12

Is there a way to group by lambda function in pyspark pandas

1 Answers1