Really struggling to make sense of all the performance tuning information I am finding online and implementing it into my notebook.
I have the following looking dataframe
Id like to pivot / unpivot this data into a wider dataframe, ie:
At the moment I use a simple script:
def pivotData(self, data):
df = data
#df.persist()
df = df.groupBy("GROUP", "SUBGROUP").pivot("SOURCE").agg(first(F.col("VALUE")))
return df
The above does exactly what I need on my smaller subset of data pretty quickly, but as soon as I plug in the production parquets data to be consumed (Which I assume have billions of records), IT TAKES FOREVER
Other info: AWS Dev endpoint:
- Number of workers: 5
- Worker type: G.1X
- Data processing units (DPUs): 6
This post is really just a reach out to see if anyone has any tips on improving performance? Perhaps my code needs to change completely and move away from groupBy
& pivot
? I have absolutely no idea what sort of speeds I should be seeing when working with billions of records? But every article I read seems to be doing things in seconds :(
Any tips / articles you python / pyspark / glue experts have would be greatly appreciated. Growing tired of looking at this process bar doing nothing.