2

I am using Apache Spark to read data from SQL Server to CSV with the below version details:

  • implementation 'com.microsoft.azure:spark-mssql-connector_2.12:1.2.0'
  • implementation 'org.apache.spark:spark-core_2.12:3.1.3'
  • implementation group: 'org.apache.spark', name: 'spark-sql_2.12', version: '3.1.3'

Here each table data export to CSV is further splitted into muliple task through the below configurable options:

  • "lowerBound"
  • "upperBound"
  • "numPartitions"
  • "partitionColumn"

So assume if numPartition is 5, there will be 5 tasks under 1 job

Looking for help on below:

On each task completion, I need to do some task-specific operations (with some task-specific data), so is there any way to hook some listeners to each task or Job?

I know there is a way to hook the listener by extends SparkListener but that can be hooked with the whole SparkContext, which can NOT do the task-specific operations.

Atul Kumar
  • 719
  • 3
  • 8
  • 29
  • Here looking can we have a separate listener for each job, not that global listener applied to all jobs through SparkContext – Atul Kumar Sep 22 '22 at 11:32
  • 1
    You would have a better chance of getting useful answers if you provided as many details as practical about the "tasks" you are talking about and the "listeners" you (think that) you need to implement. Check out [this link](https://stackoverflow.com/help/minimal-reproducible-example) for some useful hits on ways to write "good" SO questions. – Dima Oct 05 '22 at 14:27
  • who has to make the operations, the executors or an external application? – Emiliano Martinez Oct 07 '22 at 10:57

2 Answers2

1

As others have already pointed out there is not way to attach a listener to a specific set of tasks. However, using mapPartitions you can execute arbitrary code after (or before) a partition of the dataset has been processed. As discussed in this answer a partition and a task are closely related.

As example a simple csv file with two columns and ten rows is used. The goal is to convert the second column to uppercase and print a message as soon as a partition is processed completely.

id,column
1,a
2,b
[...]
10,j

The code:

val df = spark.read.option("header", true).option("inferSchema", true).csv(<file>)
  .repartition(5) //create 5 partitions with 2 rows each
df.mapPartitions(it => {
  var counter = 0;
  val result = it.toList.map(row => {
    counter = counter + 1;
    val resultForRow = row.getString(1).toUpperCase //the "business logic"
    (row.getInt(0), resultForRow)
  })
  println(s"${Thread.currentThread().getName()}:  I have processed ${counter} rows") //the code to be executed after a partition is done
  result.iterator
}).show()

Output:

Executor task launch worker for task 0.0 in stage 4.0 (TID 3):  I have processed 2 rows
Executor task launch worker for task 2.0 in stage 6.0 (TID 6):  I have processed 2 rows
Executor task launch worker for task 1.0 in stage 6.0 (TID 5):  I have processed 2 rows
Executor task launch worker for task 0.0 in stage 6.0 (TID 4):  I have processed 2 rows
Executor task launch worker for task 3.0 in stage 6.0 (TID 7):  I have processed 2 rows

The code inside of mapPartitions runs within the executors, so the output above will appear in the executor logs.

werner
  • 13,518
  • 6
  • 30
  • 45
-1

There is no way you can attach Listener to Tasks. If you have any specific logic to execute after the completion of the job then better submit multiple jobs to the spark cluster. Hope this helps!

GPopat
  • 445
  • 4
  • 14