I would like to verify existing column's data and create new column based on the certain conditions.
Problem: I have dataset around 500 columns and 9K rows (9000). Per my logic, if one of the column has any null value then create new column with respect of that column and set null values of original column to 1 and rest 0.
But below simple code takes hours to finish although my data is not huge.
dataset_.schema.fields.map(c => {
if(dataset_.filter(col(c.name).isNull).count() > 0)
{
dataset_ = dataset_.withColumn(c.name + "_isNull", when(col(c.name).isNull, 1).otherwise(0))
}
})
Please help me to optimize my code or provide me feedback to achieve it with difference approach.
Note: I had tried same thing on big cluster (spark yarn). Google Dataproc cluster (3 worker node, machine type 32 vCPU, 280 GB memory)