0

I have a following PySpark dataframe:

year    week    date    time          value
2020    1     20201203  2:00 - 2:15    23.9
2020    1     20201203  2:15 - 2:30    45.87
2020    1     20201203  2:30 - 2:45    87.76
2020    1     20201203  2:45 - 3:00    12.87

I want to transpose the time and value column. The desired output should be:

 year   week    date    2:00 - 2:15     2:15 - 2:30     2:30 - 2:45    2:45 - 3:00
 2020   1     20201203    23.9             45.87           87.76          12.87
Asma Damani
  • 197
  • 3
  • 11

1 Answers1

2

You can use groupby and pivot.

df = df.groupby('year', 'week', 'date').pivot('time').max('value')
Emma
  • 8,518
  • 1
  • 18
  • 35
  • why aggregation is max? we dont have multiple values in 'value' column for corresponding 'time' column. Will first() will not serve the purpose? – Asma Damani Nov 17 '22 at 00:27
  • Data is huge. pivot is taking ages to run. Any other efficient approach is required. – Asma Damani Nov 17 '22 at 00:48
  • 1
    `max`, `min` or `first` would return same because there is only 1 value. you can pick your favorite. but for the `first`, you need `.agg(F.first('value'))`. – Emma Nov 17 '22 at 01:50
  • 1
    For efficiency, check here https://stackoverflow.com/questions/69191203/most-efficient-way-of-pivoting-huge-dataframes-pyspark. or look up "pyspark efficient pivot" in stackoverflow. You'll see a couple of similar questions. – Emma Nov 17 '22 at 01:59