0

Really struggling to make sense of all the performance tuning information I am finding online and implementing it into my notebook.

I have the following looking dataframe

enter image description here

Id like to pivot / unpivot this data into a wider dataframe, ie:

enter image description here

At the moment I use a simple script:

def pivotData(self, data):
      
        df = data
  
        #df.persist()
        
        df = df.groupBy("GROUP", "SUBGROUP").pivot("SOURCE").agg(first(F.col("VALUE")))

        return df

The above does exactly what I need on my smaller subset of data pretty quickly, but as soon as I plug in the production parquets data to be consumed (Which I assume have billions of records), IT TAKES FOREVER

Other info: AWS Dev endpoint:

  • Number of workers: 5
  • Worker type: G.1X
  • Data processing units (DPUs): 6

This post is really just a reach out to see if anyone has any tips on improving performance? Perhaps my code needs to change completely and move away from groupBy & pivot ? I have absolutely no idea what sort of speeds I should be seeing when working with billions of records? But every article I read seems to be doing things in seconds :(

Any tips / articles you python / pyspark / glue experts have would be greatly appreciated. Growing tired of looking at this process bar doing nothing.

enter image description here

Fizor
  • 1,480
  • 1
  • 16
  • 31
  • 2
    you could try specifying the `values=[]` parameter. it takes in the list of values to make columns - they should be of the same case as in the dataframe column. [link](http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.GroupedData.pivot) -- this increases efficiency by a lot because spark does not have to take a distinct of the pivot column and sort it while collecting the same as a list - you're saving it from doing those costly steps! – samkart Sep 15 '21 at 10:32
  • added understanding -- if you look at the spark jobs, your pivot step would have been executed in 2 separate jobs - the second one being called the pivot job, first one being a mapper or reducer (this job collects the distinct elements from the pivot column as a sorted list). With the `values=[]` parameter, it goes straight to the pivot job as it does not have to read the column names from the data. – samkart Sep 15 '21 at 10:40
  • 2
    Also 5 workers are not that many for a job that is supposed to run on billions of records. – Robert Kossendey Sep 15 '21 at 10:42
  • agreed, maybe add the volume of data and the size of cluster along with the job specs for us to deep dive a little. – samkart Sep 15 '21 at 10:45
  • Thanks for the info so far guys - let me see if I can get hat information you are after. I try performing a ```count()``` on my prod dataframe, but that also takes SO long, that I end up cancelling it. – Fizor Sep 15 '21 at 10:48
  • Seems our Dev endpoints spin up an EC2 instance on another account which I don't have access to, so unfortunately I cannot get any cluster / hardware info for you. Looks like im just going to have to continue my struggle and optimize as best I can. Thank you for the above input! :) – Fizor Sep 15 '21 at 11:46
  • 1
    @Mike There's this [video](https://www.youtube.com/watch?v=daXEp4HmS-E&t=2856s&ab_channel=Databricks) published by Databricks telling tips to improve your performance. It's not a silver bullet, but it really helps to understading key points to tuning – Kafels Sep 15 '21 at 11:49
  • As an alternative you can consider using `SELECT CASE` in PySpark SQL: https://spark.apache.org/docs/latest/sql-ref-syntax-qry-select-case.html with grouping like here: https://stackoverflow.com/questions/51372566/presto-sql-pivoting-for-lack-of-a-better-word-data – Beniamin H Sep 15 '21 at 17:52

0 Answers0