0

I originally used the below code to work with a standard pandas df. Switched to pyspark pandas df once data grew. I've been unable to make this groupby work on the pyspark pandas df. I've also tried to replicate on a spark df using spark functions, but my knowledge there is limited so I haven't had any luck. Any tips/advice would be much appreciated.

df1 = df1.groupby(['FISCAL_YEAR', 'FISCAL_MONTH', 'FISCAL_WEEK','ORDER_NUMBER', 'LINE_TYPE'], as_index=False).agg({
                                                                                          'DEPARTMENT': lambda x: ' | '.join(sorted(x.unique()))
                                                                                          ,'Dept_Subdept': lambda x: ' | '.join(sorted(x.unique()))
                                                                                          ,'Demand': 'sum'
                                                                                          ,'COGS': 'sum'
                                                                                          ,'Units':'sum'
                                                                                          })

ValueError: aggs must be a dict mapping from column name to aggregate functions (string or list of strings).

BigBen
  • 46,229
  • 7
  • 24
  • 40
cmb66
  • 1
  • example of the data ? – D.L Sep 13 '22 at 15:54
  • Lambdas won't work here. I think you need a combination of [collect_set](https://spark.apache.org/docs/3.3.0/api/sql/index.html#collect_set), [array_sort](https://spark.apache.org/docs/3.3.0/api/sql/index.html#array_sort) and [concat_ws](https://spark.apache.org/docs/3.3.0/api/sql/index.html#concat_ws) – werner Sep 13 '22 at 16:10
  • https://stackoverflow.com/questions/62664691/pyspark-using-agg-to-concat-string-after-groupby – BeRT2me Sep 13 '22 at 16:20

1 Answers1

0
import pyspark.sql.functions as F

df1 = (df1.groupby(['FISCAL_YEAR', 
                    'FISCAL_MONTH', 
                    'FISCAL_WEEK',
                    'ORDER_NUMBER', 
                    'LINE_TYPE'])
          .agg(F.concat_ws(' | ', F.collect_set('DEPARTMENT')).alias('DEPARTMENTS'),
               F.concat_ws(' | ', F.collect_set('Dept_Subdept')).alias('Dept_Subdepts'),
               F.sum('Demand').alias('Total_Demand'),
               F.sum('COGS').alias('Total_COGS'),
               F.sum('Units').alias('Total_Units')))
BeRT2me
  • 12,699
  • 2
  • 13
  • 31
  • They're not sorted, but hopefully this gives you something to start with~ – BeRT2me Sep 13 '22 at 16:34
  • Appreciate the help. Unfortunately I'm getting this error: 'DataFrame' object has no attribute 'to_spark' – cmb66 Sep 13 '22 at 17:48
  • Oh, then you're not using the `pyspark.pandas` api like I assumed. I removed the parts specific to that. – BeRT2me Sep 13 '22 at 19:07
  • Worked. Thanks a ton. Now one more question: what if I wanted to apply a function(r) to df1? Typically, I'd define the function and then df1['NewCol'] = df1.apply(function, axis=1), but I don't believe that will work in this instance. – cmb66 Sep 13 '22 at 20:38
  • I think you'll have to figure out [udf](https://spark.apache.org/docs/3.1.1/api/python/reference/api/pyspark.sql.functions.udf.html) – BeRT2me Sep 13 '22 at 21:12