0

I read an RDBMS tables on PostgreSQL DB as:

val dataDF    = spark.read.format("jdbc").option("url", connectionUrl)
                            .option("dbtable", s"(${execQuery}) as year2017")
                            .option("user", devUserName)
                            .option("password", devPassword)
                            .option("numPartitions",10)
                            .load()

The option: numPartitions denote number of partitions your data is split into and then process each partition in parallel manner, in this case it is: 10. I thought this is a cool option in spark until I came across the awesome feature of spark-submit: --num-executors, --executor-cores, --executor-memory. I read the concept of the three aforementioned parameters in spark-submit from this link: here

What I don't understand is, if both are used for parallel processing, how different are both to each other ?

Could anyone let me know the difference between the above mentioned options ?

OneCricketeer
  • 179,855
  • 19
  • 132
  • 245
Metadata
  • 2,127
  • 9
  • 56
  • 127

1 Answers1

4

In read.jdbc(..numPartitions..), numPartitions is the number of partitions your data (Dataframe/Dataset) has. In other words, all subsequent operations on the read Dataframe will have a degree of parallelism equal to numPartitions. (This option also controls the number of parallel connections made to your JDBC source.)

To understand --num-executors, --executor-cores, --executor-memory, you should understand the concept of a Task. Every operation you perform on a Dataframe (or Dataset) converts to a Task on a partition of the Dataframe. Thus, one Task exists for each operation on each partition of the data.

Task execute on an Executor. --num-executors control the number of executors which will be spawned by Spark; thus this controls the parallelism of your Tasks. The other two options, --executor-cores and --executor-memory control the resources you provide to each executor. This depends, among other things, on the number of executors you wish to have on each machine.

P.S: This assumes that you are manually allocating the resources. Spark is also capable of dynamic allocation.

For more information on this, you could use these links:


EDITS:

The following statement has an important caveat:

all subsequent operations on the read Dataframe will have a degree of parallelism equal to numPartitions.

Operations such as joins and aggregations (which involve shuffle) as well as operations such as union (which does not shuffle data) could change the partition factor.

suj1th
  • 1,781
  • 2
  • 14
  • 22
  • Okay. So these two are independent of each other right ? – Metadata Sep 18 '18 at 10:41
  • In a way, yes. The number of partitions control the parallelism. The above executor configurations control how interleaved the execution on the partitions will be. – suj1th Sep 18 '18 at 11:00
  • `all subsequent operations on the read Dataframe will have a degree of parallelism equal to numPartitions` this is not true, certain operations (shuffles) change the number of partitions – Raphael Roth Sep 18 '18 at 11:20
  • I agree. I should have mentioned a caveat that operations such as `joins`, or aggregations (which involve shuffle) as well as operations such as `union` (which does not involve shuffle) could change the partition factor. Will edit the answer. Thank you, @RaphaelRoth. – suj1th Sep 18 '18 at 11:36