0

I have a script does the union of multiple dataframes and inserts result in a CSV file. I need to optimize its execution speed. I recently learned about cache and unpersist. This is what I did:

val grc = Tables.getGRC(spark) // This is my first df.
    val grc_cache = grc.cache()
    var sigma = Tables.getSIGMA(spark, use_database_sigma(0)) // Second DF
    var sigma_cache = sigma.cache()
    for(i <- 1 until use_database_sigma.length) //Second DF is a union of multiple DFs
      {
      if (use_database_sigma(i) != "")
       {
          sigma = sigma.union(Tables.getSIGMA(spark, use_database_sigma(i)))
          sigma_cache=sigma.cache
        }
      }

    val grc_sigma = sigma.union(grc) // Is this correct? Should I union the cached DFs?
    val res = grc_sigma.cache
    LogDev.ecrireligne("total : " + grc_sigma.count())
    res.repartition(1).write.mode(SaveMode.Overwrite).format("csv").option("header", true).option("delimiter", "|").save(Parametre_vigiliste.cible)
    val conf = new Configuration()
    val fs = FileSystem.get(conf)
    val file = fs.globStatus(new Path(Parametre_vigiliste.cible + "/part*"))(0).getPath().getName();
    fs.rename(new Path(Parametre_vigiliste.cible + "/" + file), new Path(Parametre_vigiliste.cible + "/" + "FIC_PER_DATALAKE_" + dataF + ".csv"));
    res.unpersist
    grc_cache.unpersist
    sigma_cache.unpersist

Is what I did correct? Thank you

Haha
  • 973
  • 16
  • 43

2 Answers2

3

If you don't need the source data twice there is no need for caching. In fact, the overhead of loading the data into memory only makes your process slower. The only line where you use data twice is when doing the count. If you can live without the count you should just remove it and then remove all the .cache statements. Otherwise it suffices to only cache the table once before the count. If you need the count for logging only it might also be faster to re-read the dataframe from disk after having saved it and do a count on that one.

The biggest performance red-flag is the repartition(1) though, since this basically forces spark to do all the writing work on a single executor and not in parallel. You can check the sparkUI if the last task is taking much longer than all others.

Paul
  • 1,114
  • 8
  • 11
  • Thank you for your answer. What alternative do you suggest to replace repartition(1)? – Haha Sep 06 '19 at 12:18
  • Spark is not a good tool for writing to csv if the result should be a single file. If you don't mind more than 1 file you can just use repartition(numberOfFiles) – Paul Sep 06 '19 at 13:48
  • I really don't get the amount of people wanting coalese(1), REPARTITION(1) and renaming of directories. Use the tool set for what it is intended for and the way it is intended for. – thebluephantom Sep 07 '19 at 12:42
1

I think you don't need to cache data here. I even doubt that you should be using Spark for such operations as you're trying to write to a single file. Note, that appending to a single file can be slow. The real power of Spark comes with parallelism.

Also, caching is not required becaus caching only helps the next time when you're using the same data. In your case I think you can probably omit count and remove caching altogether. Caching is an action and that materializes dataframe. Again when you try to write the data to output location, then again it is another action. If you're not using the same data again, there is no point in keeping it in memory.

If data volume is not larger, you should probably use simple Java or Scala program to write files to HDFS or write to local file system and move that file to HDFS later.

This link may be useful. Write a file in hdfs with Java

Piyush Patel
  • 1,646
  • 1
  • 14
  • 26