1

I have a job that takes in a huge dataset and joins it with another dataset. The first time it ran, it took a really long time and Spark executed a FileScan parquet when reading the dataset, but in future jobs the query plan shows Scan ExistingRDD and the build takes minutes.

Why and how is Spark able to scan an existing RDD? Would it ever fall back to scanning the parquet files that back a dataset (and hence revert to worse performance)?

vanhooser
  • 1,497
  • 3
  • 19

1 Answers1

0

There are two common situations in Foundry in which you'll see this:

  1. You're using a DataFrame you defined manually through createDataFrame
  2. You're running an incremental transform with an input that doesn't have any changes, so you're using an empty synthetic DataFrame that Transforms has created for you (a special case of 1.)

If we follow the Spark code, we see the definition of the call noted, Scan ExistingRDD, this in turn calls into RDDScanExec, which is a mapper for InternalRows (a representation of literal values held by the Driver and synthesized into a DataFrame).

vanhooser
  • 1,497
  • 3
  • 19