I am curious about how scope works with Data Frame and Spark. In the example below, I have a list of file, each independently loaded in a Data Frame, some operation is performed, then, we write dfOutput
to disk.
val files = getListOfFiles("outputs/emailsSplit")
for (file <- files){
val df = sqlContext.read
.format("com.databricks.spark.csv")
.option("delimiter","\t") // Delimiter is tab
.option("parserLib", "UNIVOCITY") // Parser, which deals better with the email formatting
.schema(customSchema) // Schema of the table
.load(file.toString) // Input file
val dfOutput = df.[stuff happens]
dfOutput.write.format("com.databricks.spark.csv").mode("overwrite").option("header", "true").save("outputs/sentSplit/sentiment"+file.toString+".csv")
}
- Is each Data Frame inside the
for loop
discarded when a loop is done, or do they stay in memory? - If they are not discarded, what is a better way to do memory management at this point?