83

Is my understanding right?

  1. Application: one spark submit.

  2. job: once a lazy evaluation happens, there is a job.

  3. stage: It is related to the shuffle and the transformation type. It is hard for me to understand the boundary of the stage.

  4. task: It is unit operation. One transformation per task. One task per transformation.

Help wanted to improve this understanding.

erdavila
  • 341
  • 2
  • 15
cdhit
  • 1,384
  • 1
  • 15
  • 38

5 Answers5

79

The main function is the application.

When you invoke an action on an RDD, a "job" is created. Jobs are work submitted to Spark.

Jobs are divided into "stages" based on the shuffle boundary. This can help you understand.

Each stage is further divided into tasks based on the number of partitions in the RDD. So tasks are the smallest units of work for Spark.

mahmoud mehdi
  • 1,493
  • 1
  • 19
  • 28
rakesh
  • 1,941
  • 2
  • 16
  • 23
52

Application - A user program built on Spark using its APIs. It consists of a driver program and executors on the cluster.

Job - A parallel computation consisting of multiple tasks that gets spawned in response to a Spark action (e.g., save(), collect()). During interactive sessions with Spark shells, the driver converts your Spark application into one or more Spark jobs. It then transforms each job into a DAG. This, in essence, is Spark’s execution plan, where each node within a DAG could be a single or multiple Spark stages.

Stage - Each job gets divided into smaller sets of tasks called stages that depend on each other. As part of the DAG nodes, stages are created based on what operations can be performed serially or in parallel. Not all Spark operations can happen in a single stage, so they may be divided into multiple stages. Often stages are delineated on the operator’s computation boundaries, where they dictate data transfer among Spark executors.

Task - A single unit of work or execution that will be sent to a Spark executor. Each stage is comprised of Spark tasks (a unit of execution), which are then federated across each Spark executor; each task maps to a single core and works on a single partition of data. As such, an executor with 16 cores can have 16 or more tasks working on 16 or more partitions in parallel, making the execution of Spark’s tasks exceedingly parallel! Spark stage creating one or more tasks to be distributed to executors

Disclaimer: Content copied from: Learning Spark

venus
  • 1,188
  • 9
  • 18
  • 1
    How could an executor with 16 cores work on "more" than 16 tasks? – mkirzon May 26 '22 at 11:45
  • Let's say you have 32 tasks, the first 16 run in parallel, and then as each of those first 16 finish, the next task (so the 17th of 32 would be first up) will get picked up by the open core – sepandr Apr 14 '23 at 17:19
10

From 7-steps-for-a-developer-to-learn-apache-spark

An anatomy of a Spark application usually comprises of Spark operations, which can be either transformations or actions on your data sets using Spark’s RDDs, DataFrames or Datasets APIs. For example, in your Spark app, if you invoke an action, such as collect() or take() on your DataFrame or Dataset, the action will create a job. A job will then be decomposed into single or multiple stages; stages are further divided into individual tasks; and tasks are units of execution that the Spark driver’s scheduler ships to Spark Executors on the Spark worker nodes to execute in your cluster. Often multiple tasks will run in parallel on the same executor, each processing its unit of partitioned dataset in its memory.

Michael West
  • 1,636
  • 16
  • 23
10

A very nice definition I found in Cloudera documentation. Here is the point.

In MapReduce, the highest-level unit of computation is a job. A job loads data, applies a map function, shuffles it, applies a reduce function, and writes data back out to persistent storage. But in Spark, the highest-level unit of computation is an application. A Spark application can be used for a single batch job, an interactive session with multiple jobs, or a long-lived server continually satisfying requests. A Spark job can consist of more than just a single map and reduce.

Rags
  • 1,891
  • 18
  • 19
6

In Spark, when spark-submit get called user code is divided in small parts called jobs, stages and tasks.

Job- A Job is a sequence of Stages, triggered by an Action such as .count(), foreachRdd(), collect(), read() or write().

Stage A Stage is a sequence of Tasks that can all be run together, in parallel, without a shuffle. e.g.- using .read to read a file from disk, then running .map and .filter can all be done without a shuffle, so it can fit in a single stage.

Task A Task is a single operation (.map or .filter) applied to a single Partition. Each Task is executed as a single thread in an Executor. If your dataset has 2 Partitions, an operation such as a filter() will trigger 2 Tasks, one for each Partition. i.e. Tasks are executed on executors and their number depend on the number of partitions. 1 task is needed for 1 partition.

ak17
  • 132
  • 2
  • 8