I have a job that takes in a huge dataset and joins it with another dataset. The first time it ran, it took a really long time and Spark executed a FileScan parquet
when reading the dataset, but in future jobs the query plan shows Scan ExistingRDD
and the build takes minutes.
Why and how is Spark able to scan an existing RDD? Would it ever fall back to scanning the parquet files that back a dataset (and hence revert to worse performance)?