I have this question that I have not been able to find its answer anywhere.
I am using the following lines to load data within a PySpark application:
loadFile = self.tableName+".csv"
dfInput= self.sqlContext.read.format("com.databricks.spark.csv").option("header", "true").load(loadFile)
My cluster configuration is as follows:
- I am using a Spark cluster with 3 nodes: 1 node is used to start the master, the other 2 nodes are running 1 worker each.
- I submit the application from outside the cluster on a login node, with a script.
- The script submits the Spark application with cluster deploy mode which I think, then in this case, makes a driver run on any of the 3 nodes I am utilising.
- The input CSV files are stored in a globally visible temporary file system (Lustre).
In Apache Spark Standalone, how is the process of loading partitions to RAM?
- Is it that each executor accesses to the driver's node RAM and loads partitions to its own RAM from there? (Storage --> driver's RAM --> executor's RAM)
- Is it that each executor accesses to storage and loads to its own RAM? (Storage --> executor's RAM)
Is it none of these and I am missing something here? How can I witness this process by myself (monitoring tool, unix command, somewhere in Spark)?
Any comment or resource in which I can get deep into this would be very helpful. Thanks in advance.