I have a question about sequential processing within a Spark batch. Here is a stylized version of the question I am trying to get answer on to keep it simple.
import org.apache.spark.sql.SparkSession
val spark = SparkSession
.builder()
.appName("Simple Dataframe Processing")
.config("spark.some.config.option", "some-value")
.getOrCreate()
// For implicit conversions like converting RDDs to DataFrames
import spark.implicits._
val df = spark.read.json("devices.json")
// Displays the content of the DataFrame to stdout
df.show()
// +-------------------------+
// | device-guid| Operation|
// +----+-------+-------------
// |1234 | Add 3 |
// |1234 | Sub 3 |
// |1234 | Add 2 |
// |1234 | Sub 2 |
// |1234 | Add 1 |
// |1234 | Sub 1 |
// +----+-------+------------+
//I have a Database with one table with following columns
// device-guid (primary key) result
//I would like to take df and for each row in the df do a update operation to a single DB row, Adding or removing number as described in Operation column
//So the result I am expecting at the end of this in the DB is a single row with
// device-guid result
// 1234 0
df.foreach { row =>
UpdateDB(row) //Update the DB with the row's Operation.
//Actual method not shown
}
Let us say I run this in a spark cluster with YARN with 5 executors with 2 core each across 5 worker nodes. What in Spark guarantees that the UpdateDB operation is scheduled and executed in sequence of the rows in the dataframe and not EVER scheduled and executed in parallel?
i.e I always want to get answer of 0 in result column in my DB.
The question in the larger sense is 'what guarantees sequential processing of operations on a dataframe even with multiple executors and cores'?
Can you point me Spark document that indicates that these tasks will be processed in sequence?
Is there any Spark property that need to be set for this to work?
Regards,
Venkat