1

On a spark shell I use the below code to read from a csv file

val df = spark.read.format("org.apache.spark.csv").option("header", "true").option("mode", "DROPMALFORMED").csv("/opt/person.csv") //spark here is the spark session
df.show()

Assuming this displays 10 rows. If I add a new row in the csv by editing it, would calling df.show() again show the new row? If so, does it mean that the dataframe reads from an external source (in this case a csv file) on every action?

Note that I am not caching the dataframe nor I am recreating the dataframe using the spark session

Andy Dufresne
  • 6,022
  • 7
  • 63
  • 113

2 Answers2

1

TL;DR DataFrame is not different than RDD. You can expect that the same rules apply.

With simple plan like this the answer is yes. It will read data for every show although, if action doesn't require all data (like here0 it won't read complete file.

In general case (complex execution plans) data can accessed from the shuffle files.

  • I didn't follow your last statement. Also doesn't spark try to keep the rdd in memory if memory is available (even when cache() or persist() are not called)? What would the right documentation link that would explain this behavior in detail? – Andy Dufresne Dec 05 '16 at 12:32
1

After each action spark forgets about the loaded data and any intermediate variables value you used in between.

So, if you invoke 4 actions one after another, it computes everything from beginning each time.

Reason is simple, spark works by building DAG, which lets it visualize path of operation from reading of data to action, and than it executes it.

That is the reason cache and broadcast variables are there. Onus is on developer to know and cache, if they know they are going to reuse that data or dataframe N number of times.

Abhishek Anand
  • 1,940
  • 14
  • 27