0

I'm having a little difficulty in further manipulating a pyspark pivot table to give me a reduced result. My data is a little more complex than the example below, but it's the best example I can come up with to illustrate what I'm trying to do:

Let's say I have a table as follows:

stock day time hour price units price_x_units
A 1 1:10 1 2.1 3 6.3
A 1 1:15 1 2.0 4 8.0
A 2 1:03 1 2.2 2 4.4
A 2 2:45 2 1.0 4 4.0
B 1 2:12 2 1.2 5 6.0
... ... ... ... ... ...

So the data above represents stock prices that fluctuate in price during the day, and the number of units bought at specific times during that day. I can pivot this as follows:

pivotdf = df.groupby("stock", "hour").pivot("day").sum("price_x_units", "units")

To give me

(stock,hour) \ day 1 2 3 4 ...
(A,1) (14.3, 7) (4.4, 2) ... ...
(A,2) (4.0, 2) ...
(B,2) (6.0,5)
... ... ... ... ...

However I want to perform the calculation "price_x_units" / "units" for each resultant cell:

(stock,hour) \ day 1 2 3 4 ...
(A,1) 2.04 2.2 ... ...
(A,2) 2.0 ...
(B,2) 1.2
... ... ... ... ...

Further to this, I then want to aggregate (sum) away the stock to give me:

hour\day 1 2 3 4
1 2.04 2.2
2 3.2

How exactly do I do this? Thanks.

zenith7
  • 151
  • 1
  • 3
  • 8
  • `pivotdf = df.groupby("stock", "hour").pivot("day").agg((col("price_x_units") / col("units")).alias("ratio"))` `result = pivotdf.groupby("hour").agg(*[sum(col(str(day))).alias(str(day)) for day in range(1, 5)])` `result.show()` Should do the trick for you – Raky Aug 22 '23 at 07:03
  • Thanks @Raky. I tried that but the first command gives `AnalysisException: Aggregate expression required for pivot, but 'price_x_units' did not appear in any aggregate function.` – zenith7 Aug 22 '23 at 08:03

2 Answers2

1

After the pivot, you could compute the division for each day like this:

# extract the days that are present in the df:
days = set([i.split('_')[0] for i in pivoted_df.columns[2:]])
# you can also just use range(1, 32) if you are sure that all the days are in the df

pivoted_df\
    .select('stock', 'hour',
       *[(F.col(i+'_sum(price_x_units)') / F.col(i+'_sum(units)')).alias(i) for i in days]
    ).show()
+-----+----+-----------------+----+
|stock|hour|                1|   2|
+-----+----+-----------------+----+
|    A|   1|2.042857142857143| 2.2|
|    A|   2|             null| 1.0|
|    B|   2|              1.2|null|
+-----+----+-----------------+----+
Oli
  • 9,766
  • 5
  • 25
  • 46
0

you can try this :

from pyspark.sql import functions as F

data = [('A', 1, '1:10', 1, 2.1, 3, 6.3),
        ('A', 1, '1:15', 1, 2.0, 4, 8.0),
        ('A', 2, '1:03', 1, 2.2, 2, 4.4),
        ('A', 2, '2:45', 2, 1.0, 4, 4.0),
        ('B', 1, '2:12', 2, 1.2, 5, 6.0)]

df = spark.createDataFrame(data, ["stock", "day", "time", "hour", "price", "units", "price_x_units"])

df = df.withColumn("avg_price", F.col("price_x_units") / F.col("units"))

pivotdf = df.groupby("stock", "hour").pivot("day").agg(F.sum("avg_price"))

pivotdf.show()

result = pivotdf.groupBy("hour").sum()
result = result.drop("stock").orderBy("hour")

result.show()

Aymen Azoui
  • 369
  • 2
  • 4