0

I am curious about how scope works with Data Frame and Spark. In the example below, I have a list of file, each independently loaded in a Data Frame, some operation is performed, then, we write dfOutput to disk.

val files = getListOfFiles("outputs/emailsSplit")

for (file <- files){

   val df = sqlContext.read
      .format("com.databricks.spark.csv")
      .option("delimiter","\t")          // Delimiter is tab
      .option("parserLib", "UNIVOCITY")  // Parser, which deals better with the email formatting
      .schema(customSchema)              // Schema of the table
      .load(file.toString)                        // Input file


   val dfOutput = df.[stuff happens]

    dfOutput.write.format("com.databricks.spark.csv").mode("overwrite").option("header", "true").save("outputs/sentSplit/sentiment"+file.toString+".csv") 

}
  1. Is each Data Frame inside the for loop discarded when a loop is done, or do they stay in memory?
  2. If they are not discarded, what is a better way to do memory management at this point?
Béatrice Moissinac
  • 934
  • 2
  • 16
  • 41

1 Answers1

2

DataFrame objects are tiny. However they can reference data in cache on Spark executors, and they can reference shuffle files on Spark executors. When the DataFrame is garbage collected that also causes the cache and shuffle files to be deleted on the executors.

In your code there are no references to the DataFrames past the loop. So they are eligible garbage collection. Garbage collection typically happens in response to memory pressure. If you worry about shuffle files filling up disk, it may make sense to trigger an explicit GC to make sure shuffle files are deleted for DataFrames that are no longer references.

Depending on what you do with the DataFrame ([stuff happens]) it may be that no data is ever stored in memory. This is the default mode of operation in Spark. If you just want to read some data, transform it, and write out back out, it will all happen line-by-line, never storing any of it in memory. (Caching only happens when you explicitly ask for it.)

With all that, I suggest not worrying about memory management until you have problems.

Daniel Darabos
  • 26,991
  • 10
  • 102
  • 114
  • Thank you, very instructive answer! – Béatrice Moissinac Jun 27 '16 at 16:17
  • Given a linear process of transformation over a df, say something like df1 = df0.bla(); df2 = df1.blabla(); df3 = df2.blablabla(), when is df1 garbaged? At the end of the scope it is in, or when the program realizes it won't use further down the line (essentially once df2 is created, because there is no further call to df1)? – Béatrice Moissinac Jul 11 '16 at 18:13
  • 1
    Descendent RDDs (like those in `df2`) reference their parents (those in `df1`). So `df1` will only be collected once it is out of scope and all of its descendants have been garbage collected as well. This is because RDDs are lazy. The instruction in `df1` (e.g. "read this file") will not be executed immediately, only when an action (e.g. "count the lines") is performed. So a reference must be kept to the ancestor RDDs. – Daniel Darabos Jul 12 '16 at 08:06