I have a script does the union of multiple dataframes and inserts result in a CSV file. I need to optimize its execution speed. I recently learned about cache and unpersist. This is what I did:
val grc = Tables.getGRC(spark) // This is my first df.
val grc_cache = grc.cache()
var sigma = Tables.getSIGMA(spark, use_database_sigma(0)) // Second DF
var sigma_cache = sigma.cache()
for(i <- 1 until use_database_sigma.length) //Second DF is a union of multiple DFs
{
if (use_database_sigma(i) != "")
{
sigma = sigma.union(Tables.getSIGMA(spark, use_database_sigma(i)))
sigma_cache=sigma.cache
}
}
val grc_sigma = sigma.union(grc) // Is this correct? Should I union the cached DFs?
val res = grc_sigma.cache
LogDev.ecrireligne("total : " + grc_sigma.count())
res.repartition(1).write.mode(SaveMode.Overwrite).format("csv").option("header", true).option("delimiter", "|").save(Parametre_vigiliste.cible)
val conf = new Configuration()
val fs = FileSystem.get(conf)
val file = fs.globStatus(new Path(Parametre_vigiliste.cible + "/part*"))(0).getPath().getName();
fs.rename(new Path(Parametre_vigiliste.cible + "/" + file), new Path(Parametre_vigiliste.cible + "/" + "FIC_PER_DATALAKE_" + dataF + ".csv"));
res.unpersist
grc_cache.unpersist
sigma_cache.unpersist
Is what I did correct? Thank you