manipulating multiple sum() values in pyspark pivot table

Question

I'm having a little difficulty in further manipulating a pyspark pivot table to give me a reduced result. My data is a little more complex than the example below, but it's the best example I can come up with to illustrate what I'm trying to do:

Let's say I have a table as follows:

stock	day	time	hour	price	units	price_x_units
A	1	1:10	1	2.1	3	6.3
A	1	1:15	1	2.0	4	8.0
A	2	1:03	1	2.2	2	4.4
A	2	2:45	2	1.0	4	4.0
B	1	2:12	2	1.2	5	6.0
...	...	...	...	...	...

So the data above represents stock prices that fluctuate in price during the day, and the number of units bought at specific times during that day. I can pivot this as follows:

pivotdf = df.groupby("stock", "hour").pivot("day").sum("price_x_units", "units")

To give me

(stock,hour) \ day	1	2	3	4 ...
(A,1)	(14.3, 7)	(4.4, 2)	...	...
(A,2)		(4.0, 2)	...
(B,2)		(6.0,5)
...	...	...	...	...

However I want to perform the calculation "price_x_units" / "units" for each resultant cell:

(stock,hour) \ day	1	2	3	4 ...
(A,1)	2.04	2.2	...	...
(A,2)		2.0	...
(B,2)		1.2
...	...	...	...	...

Further to this, I then want to aggregate (sum) away the stock to give me:

hour\day	1	2	3	4
1	2.04	2.2
2		3.2

How exactly do I do this? Thanks.

`pivotdf = df.groupby("stock", "hour").pivot("day").agg((col("price_x_units") / col("units")).alias("ratio"))` `result = pivotdf.groupby("hour").agg(*[sum(col(str(day))).alias(str(day)) for day in range(1, 5)])` `result.show()` Should do the trick for you — Raky, Aug 22 '23 at 07:03
Thanks @Raky. I tried that but the first command gives `AnalysisException: Aggregate expression required for pivot, but 'price_x_units' did not appear in any aggregate function.` — zenith7, Aug 22 '23 at 08:03

score 1 · Accepted Answer · answered Aug 22 '23 at 07:57

After the pivot, you could compute the division for each day like this:

# extract the days that are present in the df:
days = set([i.split('_')[0] for i in pivoted_df.columns[2:]])
# you can also just use range(1, 32) if you are sure that all the days are in the df

pivoted_df\
    .select('stock', 'hour',
       *[(F.col(i+'_sum(price_x_units)') / F.col(i+'_sum(units)')).alias(i) for i in days]
    ).show()

+-----+----+-----------------+----+
|stock|hour|                1|   2|
+-----+----+-----------------+----+
|    A|   1|2.042857142857143| 2.2|
|    A|   2|             null| 1.0|
|    B|   2|              1.2|null|
+-----+----+-----------------+----+

score 0 · Answer 2 · answered Aug 22 '23 at 09:38

you can try this :

from pyspark.sql import functions as F

data = [('A', 1, '1:10', 1, 2.1, 3, 6.3),
        ('A', 1, '1:15', 1, 2.0, 4, 8.0),
        ('A', 2, '1:03', 1, 2.2, 2, 4.4),
        ('A', 2, '2:45', 2, 1.0, 4, 4.0),
        ('B', 1, '2:12', 2, 1.2, 5, 6.0)]

df = spark.createDataFrame(data, ["stock", "day", "time", "hour", "price", "units", "price_x_units"])

df = df.withColumn("avg_price", F.col("price_x_units") / F.col("units"))

pivotdf = df.groupby("stock", "hour").pivot("day").agg(F.sum("avg_price"))

pivotdf.show()

result = pivotdf.groupBy("hour").sum()
result = result.drop("stock").orderBy("hour")

result.show()

manipulating multiple sum() values in pyspark pivot table

2 Answers2