When Spark loads source data from a file into a DataFrame, what factors govern whether the data are loaded fully into memory on a single node (most likely the driver/master node) or in the minimal, parallel subsets needed for computation (presumably on the worker/executor nodes)?
In particular, if using Parquet as the input format and loading via the Spark DataFrame API, what considerations are necessary in order to ensure that loading from the Parquet file is parallelized and deferred to the executors, and limited in scope to the columns needed by the computation on the executor node in question?
(I am looking to understand the mechanism Spark uses to schedule loading of source data in the distributed execution plan, in order to avoid exhausting memory on any one node by loading the full data set.)