0

Can someone help me with pyspark using both aggregate and groupby functions? I have made my data frames, and applied filters and selects to get the data I want. However, I am now stuck trying to aggregate things correctly.

Currently, my code outputs the content below:

+----------+-----------+--------------+---------------+----------+---------+
|l_orderkey|o_orderdate|o_shippriority|l_extendedprice|l_discount|      rev|
+----------+-----------+--------------+---------------+----------+---------+
|     53634| 1995-02-22|             0|       20517.44|      0.08|18876.045|
|    265539| 1995-01-25|             0|       70423.08|      0.01| 69718.85|
|    331590| 1994-12-10|             0|       46692.75|      0.03| 45291.97|
|    331590| 1994-12-10|             0|        37235.1|       0.1| 33511.59|
|    420545| 1995-03-05|             0|        75542.1|      0.04|72520.414|
|    420545| 1995-03-05|             0|         1062.0|      0.07|987.66003|
|    420545| 1995-03-05|             0|        9729.45|       0.1| 8756.505|
|    420545| 1995-03-05|             0|        15655.6|      0.04|15029.375|
|    420545| 1995-03-05|             0|         3121.3|      0.03|3027.6611|
|    420545| 1995-03-05|             0|        71723.0|      0.03| 69571.31|
|    488928| 1995-02-15|             0|        1692.77|      0.01|1675.8423|
|    488928| 1995-02-15|             0|       22017.84|      0.01|21797.662|
|    488928| 1995-02-15|             0|       57100.42|      0.04|54816.402|
|    488928| 1995-02-15|             0|        3807.76|      0.05| 3617.372|
|    488928| 1995-02-15|             0|       73332.52|      0.01|72599.195|
|    510754| 1994-12-21|             0|       41171.78|      0.09| 37466.32|
|    512422| 1994-12-26|             0|       87251.56|      0.07| 81143.95|
|    677761| 1994-12-26|             0|       60123.34|       0.0| 60123.34|
|    956646| 1995-03-07|             0|       61853.68|      0.05|58760.996|
|   1218886| 1995-02-13|             0|        24844.0|      0.01| 24595.56|
+----------+-----------+--------------+---------------+----------+---------+

I wish to apply a group by: l_orderkey and aggregate the Rev as a sum.

Here is my most recent attempt with 't' being the dataframe and F being functions from pyspark.sql "from pyspark.sql import functions as F"

(t .groupby(t.l_orderkey,t.o_orderdate, t.o_shippriority) 
 .agg(F.collect_set(sum(t.rev)), F.collect_set(t.l_orderkey)) .show())

Can someone help me know if I'm on the right track? I keep getting "Column is not iterable"

Wispers
  • 15
  • 3

1 Answers1

0
total_rev = t.groupby(t.l_orderkey).agg(F.sum(t.rev).alias('total_rev'))

# print /show the top results
total_rev.show()

would give you a new df with columns l_orderkey, total_rev where total_rev would store the aggregated sum of rev

You use collect_set when attempting to remove duplicates.

You are also getting Column is not iterable because you are using the built in python function sum and not the spark function F.sum

ggordon
  • 9,790
  • 2
  • 14
  • 27