8

I've loaded a file into a DataFrame in Zeppelin notebooks like this:

val df = spark.read.format("com.databricks.spark.csv").load("some_file").toDF("c1", "c2", "c3")

This DataFrame has >10 million rows, and I would like to start work with just a subset of the rows, so I use limit:

val df_small = df.limit(1000)

However, now when I try to filter the DataFrame on the string value of one of the columns, I get different results every time I run the following:

df_small.filter($"c1" LIKE "something").show()

How can I take a subset of df that remains stable for every filter I run?

Karmen
  • 367
  • 1
  • 3
  • 9

1 Answers1

6

Spark works as a lazy load so only at statement .show above 2 statements will execute. you can write df_small to a file and read that alone everytime or do df_small.cache()

toofrellik
  • 1,277
  • 4
  • 15
  • 39