Iterate Spark Dataframe running slow

Question

I would like to verify existing column's data and create new column based on the certain conditions.

Problem: I have dataset around 500 columns and 9K rows (9000). Per my logic, if one of the column has any null value then create new column with respect of that column and set null values of original column to 1 and rest 0.

But below simple code takes hours to finish although my data is not huge.

dataset_.schema.fields.map(c => {
  if(dataset_.filter(col(c.name).isNull).count() > 0)
  {
    dataset_ = dataset_.withColumn(c.name + "_isNull", when(col(c.name).isNull, 1).otherwise(0))
  }
})

Please help me to optimize my code or provide me feedback to achieve it with difference approach.

Note: I had tried same thing on big cluster (spark yarn). Google Dataproc cluster (3 worker node, machine type 32 vCPU, 280 GB memory)

Aditya Chopra · Answer 1 · 2018-05-08T00:26:51.383

I tried a number of things...
1) Try to cache when you're creating dataframe from the csv file or any other source

2) Also, if it doesn't affect the logic, we can try changing this if(dataset_.filter(col(c.name).isNull).count() > 0) to if(flightData.filter(col(x.name).isNull).take(1) != null )
Instead of counting the over all the data, we can just check whether any column name was null or not As take(1) will move on as soon as it find even a single record whereas .count() will keep on running and then compare it with 0

3) Moreover as per the current logic, we can change map to foreach. However, it won't affect the performance but ideally it should be foreach.

I have tried these on a dataset having 16 columns and about 10Lakh records. It took 33s after applying all these.

Here's the Spark UI Snapshot!

Since you're having 500 columns so the runtime should decrease at massive scale applying these as compared to my data set.
I hope this helps!

score 0 · Answer 2 · answered May 08 '18 at 00:50

Compute all counts at the same time:

val convert = df.select(
  df.columns.map(c => (count(c) =!= count("*")).alias(c)): _*
).first.getValuesMap[Boolean](df.columns)

and use the result to add columns

convert.collect { case (c, true) => c }.foldLeft(df) {
  (df, c) => df.withColumn(s"${c}_isNull", when(col(c).isNull, 1).otherwise(0))
}

Iterate Spark Dataframe running slow

2 Answers2