1

I wrote a pyspark script that is taking much too long to run. The basic flow of my script is that it takes a large set of raw data and loads it into a dataframe. It then splits up the dataframe into logical small dataframes, performs aggregations on each, and then performs unions to bring them all back together into one dataframe.

I cannot paste my original script as it belongs to my employer, but I have managed to isolate this scenario and replicate it with the following on the pyspark command line:

df1 = spark.createDataFrame([(1,'sally'),(2, 'john')],['id','first_name'])
df1 = df1.union(df1)
df1.count()

When I run this code locally, it takes about 4 minutes to get a count of the dataframe rows. Caching the dataframe helps bring it down to under a minute, which I still don't find acceptable for apache-spark, which can usually process millions of rows in a couple of seconds.

I was able to restructure my code to avoid doing unions, but I would greatly appreciate if someone could show me a better-working version of my code example that I've pasted. I believe this is something that apache-spark should be able to accomplish efficently, and as a developer new to spark, it would be very helpful for me if I can get a deeper understanding of how to fix this code.

Majid Hajibaba
  • 3,105
  • 6
  • 23
  • 55
Rochelle
  • 13
  • 3
  • 1
    you're creating a union of 2^n dataframes. Are you sure that's what you're trying to do? – mck May 25 '21 at 20:23
  • if you have a list of dataframes, you can do `reduce(DataFrame.unionAll, dfs)`. See https://stackoverflow.com/questions/33743978/spark-union-of-multiple-rdds – mck May 25 '21 at 20:25

1 Answers1

2

In your implementation the plan of the union takes exponential time.
In order to avoid the cost of the planning you can do it like this:

def unionAll(*dfs):
    first, *_ = dfs  # Python 3.x, for 2.x you'll have to unpack manually
    return first.sql_ctx.createDataFrame(
        first.sql_ctx._sc.union([df.rdd for df in dfs]),
        first.schema
    )

You should take in consideration that in this way you have the cost for conversion to rdd and back to dataframe.

dasilva555
  • 93
  • 1
  • 2
  • 12