spark script with multiple unions takes too long to run

Question

I wrote a pyspark script that is taking much too long to run. The basic flow of my script is that it takes a large set of raw data and loads it into a dataframe. It then splits up the dataframe into logical small dataframes, performs aggregations on each, and then performs unions to bring them all back together into one dataframe.

I cannot paste my original script as it belongs to my employer, but I have managed to isolate this scenario and replicate it with the following on the pyspark command line:

df1 = spark.createDataFrame([(1,'sally'),(2, 'john')],['id','first_name'])
df1 = df1.union(df1)
df1.count()

When I run this code locally, it takes about 4 minutes to get a count of the dataframe rows. Caching the dataframe helps bring it down to under a minute, which I still don't find acceptable for apache-spark, which can usually process millions of rows in a couple of seconds.

I was able to restructure my code to avoid doing unions, but I would greatly appreciate if someone could show me a better-working version of my code example that I've pasted. I believe this is something that apache-spark should be able to accomplish efficently, and as a developer new to spark, it would be very helpful for me if I can get a deeper understanding of how to fix this code.

you're creating a union of 2^n dataframes. Are you sure that's what you're trying to do? — mck, May 25 '21 at 20:23
if you have a list of dataframes, you can do `reduce(DataFrame.unionAll, dfs)`. See https://stackoverflow.com/questions/33743978/spark-union-of-multiple-rdds — mck, May 25 '21 at 20:25

score 2 · Accepted Answer · answered May 25 '21 at 22:52

In your implementation the plan of the union takes exponential time.
In order to avoid the cost of the planning you can do it like this:

def unionAll(*dfs):
    first, *_ = dfs  # Python 3.x, for 2.x you'll have to unpack manually
    return first.sql_ctx.createDataFrame(
        first.sql_ctx._sc.union([df.rdd for df in dfs]),
        first.schema
    )

You should take in consideration that in this way you have the cost for conversion to rdd and back to dataframe.

spark script with multiple unions takes too long to run

1 Answers1