2

I will come to actual question but please bear with my use-case first. I have the following use-case, say I got rddStud from somewhere:

val rddStud: RDD[(String,Student)] = ???

Where 'String' - some random string and 'Student' - case class Student(name: String, id: String, arrivalTime: Long, classId: String)

I am using Student only as an example - actual business logic has much different complicated class with many fields.

What I want to achieve is - students with same id must be processed in ascending order of their arrivalTime.

For this here's what I am doing:

//Get RDD from Student.id -> Student
val studMapRdd: RDD[(String,Student)] = rddStud.map(tuple => { 
  val student = tuple._2
  (student.id,student)
})

//Make sure all students with same student.id are in same partition.
//I can potentially use groupByKey/combineByKey.... etc, but I don't see much performance difference    
val studPartitionRdd: RDD[(String,Student)] = studMapRdd.partitionBy(new HashPartitioner(studMapRdd.getNumPartitions))

val studSortedRdd: RDD[(String,Student)] = studPartitionRdd.sortBy({ case(studentId,student} => 
    student.arrivalTime
 }, ascending = true)

studSortedRdd.foreachPartition(itr =>{
    itr.foreach{ case (studentId, student) => { 
       val studentName = student.name
       val time = student.arrivalTime 
       //send for additional processing studentName and time combination
    }
})

My questions are:

  1. If I use foreachPartitionAsync - will it process all partitions parallely, but the elements in each partition in order? If not, what's the difference between foreachPartitionAsync and foreachAsync then?
  2. Does the approach of sorting after repartitioning seem reasonable? Or if you could suggest any optimizations in above logic?

Much appreciated.

zero323
  • 322,348
  • 103
  • 959
  • 935
K P
  • 861
  • 1
  • 8
  • 25
  • Why do you hash partition just to sortBy in the next step. It doesn't make sense at all. `foreachPartition` is using exactly the same mechanism as `foreach` with partition-wise parallelism. – zero323 Jun 28 '16 at 14:46
  • Say RDD has 3 partitions where events from student with id=1 are spread across the 3. Now hashpartitioning will ensure all events from id=1 will be in same partition say p1, but it won't ensure that they are sorted by arrivalTime - which is how I want to process them due to some business requirements. Is there anything wrong in my understanding? – K P Jun 28 '16 at 15:59

1 Answers1

2

Neither choice between synchronous (foreach(Partition)) and asynchronous (foreach(Partition)Async) submission nor choice between element-wise and partition-wise access will affect execution order. In the first case the important difference blocking vs non-blocking execution, in the second case the way in which data is exposed but actual execution mechanism is more or less the same.

Sorting after repartitioning is not a valid approach. sortBy will trigger full shuffle and won't preserve existing data distribution. If you want to preserve existing data layout you can either sort within subsequent mapPartitions phase or even better use repartitionAndSortWithinPartitions.

class StudentIdPartitioner[V](n: Int) extends org.apache.spark.Partitioner {
  def numPartitions: Int = n
  def getPartition(key: Any): Int = {
    val x = key.asInstanceOf[Student].id.hashCode % n
    x + (if (x < 0) n else 0)
  }
}

val rddStud: RDD[Student] = ???
val partitioner = new StudentIdPartitioner(rddStud.getNumPartitions)
val arrTimeOrdering = scala.math.Ordering.by[Student, Long](_.arrivalTime)


{
  implicit val ord = arrTimeOrdering
  rddStud.map((_, null)).repartitionAndSortWithinPartitions(partitioner)
}
zero323
  • 322,348
  • 103
  • 959
  • 935
  • SO suggests more discussion to be shifted here https://chat.stackoverflow.com/rooms/115889/discussion-between-k-p-and-zero323 – K P Jun 28 '16 at 20:32
  • Thanks for the clarifying edit ... also, probably worth a separate question, we are looking at a foreach action that's only running on one process, it updates an accumulable with a customized "add," we need it to run on all processors, is that what foreachAsync does? Any reason to use foreachAsyncPartition? Let me know if separate question is desirable thanks @zero323 – JimLohse Jul 06 '16 at 19:07
  • @JimLohse Of course, thanks for pointing that out. I am not sure if I understand your use case though. What do you mean by _one process_? Single executor? – zero323 Jul 06 '16 at 19:12
  • When this action is running (the foreach compares every element of a RDD to an accumulator), it's only using one processor in htop, we want it to "fork-join" to use all the processors. – JimLohse Jul 06 '16 at 19:13
  • To clarify, I meant " only running on one processor" sorry ... We are limiting number of cores that Spark gets per executor to 1 for other reasons and then using our code to parallelize other processes, sorry, that's probably a key point! Given that limitation I supposed the async will still just use one processor. Using `export SPARK_WORKER_CORES=1` – JimLohse Jul 06 '16 at 19:15
  • Probably too complicated a question to put in comments, I will flesh this out and post a question later or tomorrow, thanks again @zero323, the best source of info about Spark anywhere! – JimLohse Jul 06 '16 at 19:18
  • 1
    @JimLohse LOL. Yes, separate question looks like a good idea. I am still not even sure if I understand the issue :) – zero323 Jul 06 '16 at 19:19
  • Yeah I am not explaining this well, question coming later thanks again – JimLohse Jul 06 '16 at 19:21