I'm having a little difficulty in further manipulating a pyspark pivot table to give me a reduced result. My data is a little more complex than the example below, but it's the best example I can come up with to illustrate what I'm trying to do:
Let's say I have a table as follows:
stock | day | time | hour | price | units | price_x_units |
---|---|---|---|---|---|---|
A | 1 | 1:10 | 1 | 2.1 | 3 | 6.3 |
A | 1 | 1:15 | 1 | 2.0 | 4 | 8.0 |
A | 2 | 1:03 | 1 | 2.2 | 2 | 4.4 |
A | 2 | 2:45 | 2 | 1.0 | 4 | 4.0 |
B | 1 | 2:12 | 2 | 1.2 | 5 | 6.0 |
... | ... | ... | ... | ... | ... |
So the data above represents stock prices that fluctuate in price during the day, and the number of units bought at specific times during that day. I can pivot this as follows:
pivotdf = df.groupby("stock", "hour").pivot("day").sum("price_x_units", "units")
To give me
(stock,hour) \ day | 1 | 2 | 3 | 4 ... |
---|---|---|---|---|
(A,1) | (14.3, 7) | (4.4, 2) | ... | ... |
(A,2) | (4.0, 2) | ... | ||
(B,2) | (6.0,5) | |||
... | ... | ... | ... | ... |
However I want to perform the calculation "price_x_units" / "units" for each resultant cell:
(stock,hour) \ day | 1 | 2 | 3 | 4 ... |
---|---|---|---|---|
(A,1) | 2.04 | 2.2 | ... | ... |
(A,2) | 2.0 | ... | ||
(B,2) | 1.2 | |||
... | ... | ... | ... | ... |
Further to this, I then want to aggregate (sum) away the stock to give me:
hour\day | 1 | 2 | 3 | 4 |
---|---|---|---|---|
1 | 2.04 | 2.2 | ||
2 | 3.2 |
How exactly do I do this? Thanks.