How can I select a stable subset of rows from a Spark DataFrame?

Question

I've loaded a file into a DataFrame in Zeppelin notebooks like this:

val df = spark.read.format("com.databricks.spark.csv").load("some_file").toDF("c1", "c2", "c3")

This DataFrame has >10 million rows, and I would like to start work with just a subset of the rows, so I use limit:

val df_small = df.limit(1000)

However, now when I try to filter the DataFrame on the string value of one of the columns, I get different results every time I run the following:

df_small.filter($"c1" LIKE "something").show()

How can I take a subset of df that remains stable for every filter I run?

you should persist or cache the limited dataframe before you apply an action on it. — Ramesh Maharjan, Aug 11 '17 at 07:54

score 6 · Accepted Answer · answered Aug 11 '17 at 09:32

6

Spark works as a lazy load so only at statement .show above 2 statements will execute. you can write df_small to a file and read that alone everytime or do df_small.cache()

answered Aug 11 '17 at 09:32

toofrellik

1,277
4
15
39

How can I select a stable subset of rows from a Spark DataFrame?

1 Answers1

Linked