How to transpose specific columns in PySpark

Question

I have a following PySpark dataframe:

year    week    date    time          value
2020    1     20201203  2:00 - 2:15    23.9
2020    1     20201203  2:15 - 2:30    45.87
2020    1     20201203  2:30 - 2:45    87.76
2020    1     20201203  2:45 - 3:00    12.87

I want to transpose the time and value column. The desired output should be:

 year   week    date    2:00 - 2:15     2:15 - 2:30     2:30 - 2:45    2:45 - 3:00
 2020   1     20201203    23.9             45.87           87.76          12.87

score 2 · Answer 1 · answered Nov 16 '22 at 23:22

2

You can use groupby and pivot.

df = df.groupby('year', 'week', 'date').pivot('time').max('value')

answered Nov 16 '22 at 23:22

Emma

8,518
1
18
35

why aggregation is max? we dont have multiple values in 'value' column for corresponding 'time' column. Will first() will not serve the purpose? – Asma Damani Nov 17 '22 at 00:27
Data is huge. pivot is taking ages to run. Any other efficient approach is required. – Asma Damani Nov 17 '22 at 00:48
1

`max`, `min` or `first` would return same because there is only 1 value. you can pick your favorite. but for the `first`, you need `.agg(F.first('value'))`. – Emma Nov 17 '22 at 01:50
1

For efficiency, check here https://stackoverflow.com/questions/69191203/most-efficient-way-of-pivoting-huge-dataframes-pyspark. or look up "pyspark efficient pivot" in stackoverflow. You'll see a couple of similar questions. – Emma Nov 17 '22 at 01:59

How to transpose specific columns in PySpark

1 Answers1