I wrote a pyspark script that is taking much too long to run. The basic flow of my script is that it takes a large set of raw data and loads it into a dataframe
. It then splits up the dataframe
into logical small dataframe
s, performs aggregations on each, and then performs unions to bring them all back together into one dataframe
.
I cannot paste my original script as it belongs to my employer, but I have managed to isolate this scenario and replicate it with the following on the pyspark command line:
df1 = spark.createDataFrame([(1,'sally'),(2, 'john')],['id','first_name'])
df1 = df1.union(df1)
df1.count()
When I run this code locally, it takes about 4 minutes to get a count of the dataframe rows. Caching the dataframe helps bring it down to under a minute, which I still don't find acceptable for apache-spark, which can usually process millions of rows in a couple of seconds.
I was able to restructure my code to avoid doing unions, but I would greatly appreciate if someone could show me a better-working version of my code example that I've pasted. I believe this is something that apache-spark should be able to accomplish efficently, and as a developer new to spark, it would be very helpful for me if I can get a deeper understanding of how to fix this code.