I will come to actual question but please bear with my use-case first. I have the following use-case, say I got rddStud
from somewhere:
val rddStud: RDD[(String,Student)] = ???
Where 'String' - some random string and 'Student' - case class Student(name: String, id: String, arrivalTime: Long, classId: String)
I am using Student only as an example - actual business logic has much different complicated class with many fields.
What I want to achieve is - students with same id
must be processed in ascending order of their arrivalTime
.
For this here's what I am doing:
//Get RDD from Student.id -> Student
val studMapRdd: RDD[(String,Student)] = rddStud.map(tuple => {
val student = tuple._2
(student.id,student)
})
//Make sure all students with same student.id are in same partition.
//I can potentially use groupByKey/combineByKey.... etc, but I don't see much performance difference
val studPartitionRdd: RDD[(String,Student)] = studMapRdd.partitionBy(new HashPartitioner(studMapRdd.getNumPartitions))
val studSortedRdd: RDD[(String,Student)] = studPartitionRdd.sortBy({ case(studentId,student} =>
student.arrivalTime
}, ascending = true)
studSortedRdd.foreachPartition(itr =>{
itr.foreach{ case (studentId, student) => {
val studentName = student.name
val time = student.arrivalTime
//send for additional processing studentName and time combination
}
})
My questions are:
- If I use foreachPartitionAsync - will it process all partitions parallely, but the elements in each partition in order? If not, what's the difference between foreachPartitionAsync and foreachAsync then?
- Does the approach of sorting after repartitioning seem reasonable? Or if you could suggest any optimizations in above logic?
Much appreciated.