How to ensure that loading of Spark DataFrame from Parquet is distributed and parallelized?

Question

When Spark loads source data from a file into a DataFrame, what factors govern whether the data are loaded fully into memory on a single node (most likely the driver/master node) or in the minimal, parallel subsets needed for computation (presumably on the worker/executor nodes)?

In particular, if using Parquet as the input format and loading via the Spark DataFrame API, what considerations are necessary in order to ensure that loading from the Parquet file is parallelized and deferred to the executors, and limited in scope to the columns needed by the computation on the executor node in question?

(I am looking to understand the mechanism Spark uses to schedule loading of source data in the distributed execution plan, in order to avoid exhausting memory on any one node by loading the full data set.)

score 1 · Accepted Answer · edited May 23 '17 at 12:01

As long as you use spark operations, all data transformations and aggregations are perfored only on executors. Therefore there is no need for driver to load the data, its job is to manage processing flow. Driver gets the data only in case you use some terminal operations, like collect(), first(), show(), toPandas(), toLocalIterator() and similar. Additionally, executors does not load all files content into memory, but gets the smallest posible chunks (which are called partitions).

If you use column store format such as Parquet only columns required for the execution plan are loaded - this is default behaviour in spark.

Edit: I just saw that there might be a bug in spark and if you use nested columns inside your schema then unnecessary columns may be loaded, see: Why does Apache Spark read unnecessary Parquet columns within nested structures?

How to ensure that loading of Spark DataFrame from Parquet is distributed and parallelized?

1 Answers1