15

Can I say?

  1. The number of the Spark tasks equal to the number of the Spark partitions?

  2. The executor runs once (batch inside of executor) is equal to one task?

  3. Every task produce only a partition?

  4. (duplicate of 1.)

YoYo
  • 9,157
  • 8
  • 57
  • 74
cdhit
  • 1,384
  • 1
  • 15
  • 38
  • When data is processed in Spark the processing is performed by tasks which getting data from source and executing all needed transformations or actions. Transformations may be broken up across stages and generate new RDDs or Data Frames with a different numbers of partitions which can affect subsequent stage execution. – luminousmen Dec 13 '17 at 11:33

3 Answers3

15

The degree of parallelism, or the number of tasks that can be ran concurrently, is set by:

  • The number of Executor Instances (configuration)
  • The Number of Cores per Executor (configuration)
  • The Number of Partitions being used (coded)

Actual parallelism is the smaller of

  • executors * cores - which gives the amount of slots available to run tasks
  • partitions - each partition will translate to a task whenever a slot opens up.

Tasks that run on the same executor will share the same JVM. This is used by the Broadcast feature as you only need one copy of the Broadcast data per Executor for all tasks to be able to access it through shared memory.

You can have multiple executors running, on the same machine, or on different machines. Executors are the true means of scalability.

Note that each Task takes up one Thread ¹, and is assumed to be assigned to one core ².

So -

  1. Is the number of the Spark tasks equal to the number of the Spark partitions?

No (see previous).

  1. The executor runs once (batch inside of executor) is equal to one task?

An Executor is started as an environment for the tasks to run. Multiple tasks will run concurrently within that Executor (multiple threads).

  1. Every task produce only a partition?

For a task, it is one Partition in, one partition out. However, a repartitioning or shuffle/sort can happen in between tasks.

  1. The number of the Spark tasks equal to the number of the Spark partitions?

Same as (1.)

(¹) Assumption is that within your tasks, you are not multithreading yourself (never do that, otherwise core estimate will be off).

(²) Note that due to hyper-threading, you might have more than one virtual core per physical core, and thus you can have several threads per core. You might even be able to handle multiple threads (2 to 3) on a single core without hyper-threading.

YoYo
  • 9,157
  • 8
  • 57
  • 74
  • 5
    I think your answer is not clear in #1. The total number of tasks equals the number of partitions. Though, ACTIVE tasks equal to the possible parallelism. – Atais Apr 23 '20 at 19:40
  • Agree - but I think better to avoid talking about tasks as a portion of work because tasks could be potentially reused. Task is an instance of a handler for a portion of work. Partition is a portion of work. That’s the way I looked at it when writing the above. – YoYo Apr 23 '20 at 20:55
7

Partitions are a feature of RDD and are only available at design time (before an action is called).

Tasks are part of TaskSet per Stage per ActiveJob in a Spark application.

Is the number of the Spark tasks equal to the number of the Spark partitions?

Yes.

The executor runs once (batch inside of executor) is equal to one task?

That recursively uses "executor" and does not make much sense to me.

Every task produce only a partition?

Almost.

Every task produce an output of executing the code (it was created for) for the data in a partition.

The number of the Spark tasks equal to the number of the Spark partitions?

Almost.

The number of the Spark tasks in a single stage equals to the number of RDD partitions.

Jacek Laskowski
  • 72,696
  • 27
  • 242
  • 420
  • Hi Jacek, my data initially has 1364 partitions which I am repartitioning to 240 while reading then I am applying filter on it. But when I see the Spark UI, Stage 0 shows 1346 Tasks whereas df.rdd.getNumPartitions() gives 240. Could you please guide me on why I am observing such behavior? I am using pySpark 2.4.4 on EMR. – Anuranjan Sep 02 '20 at 09:05
  • @Anuranjan Please ask a separate question with enough details to help you out. Thanks! – Jacek Laskowski Sep 02 '20 at 10:01
0

1.The number of the Spark tasks equal to the number of the Spark partitions?

Yes.

Spark breaks up the data into chunks called partitions. Is a collection of rows that sit on one physical machine in the cluster. Default partition size is 128MB. Allow every executor perform work in parallel. One partition will have a parallelism of only one, even if you have many executors.

With many partitions and only one executor will give you a parallelism of only one. You need to balance the number of executors and partitions to have the desired parallelism. This means that each partition will be processed by only one executor (1 executor for 1 partition for 1 task at a time).

A good rule is that the number of partitions should be larger than the number of executors on your cluster

See also: Chambers, Bill; Zaharia, Matei. Spark: The Definitive Guide: Big Data Processing Made Simple (p. 27). O'Reilly Media. Edição do Kindle.

2.The executor runs once (batch inside of executor) is equal to one task?

Cores are slot for tasks, and each executor can process more than one partition at a time if it has more than one core.

3.Every task produce only a partition?

It depend on the transformation.

Spark has Wide transformations and Narrow Transformation.

Wide Transformation: Will have input partitions contributing to many output partitions (shuffles -> Aggregation, sort, joins). Often referred to as a shuffle whereby Spark exchange partitions across the cluster. When we perform a shuffe, Spark write the results do disk

Narrow Transformation: Which input partition will contribute to only one output partition.

See also: Chambers, Bill; Zaharia, Matei. Spark: The Definitive Guide: Big Data Processing Made Simple. O'Reilly Media. Edição do Kindle.

Note: Read file is a narrow transformation because it does not require shuffle, but when you read one file that is splittable like parquet this file will be split into many partitions

Paulo Moreira
  • 411
  • 5
  • 13