I am trying to build a minimal working example of repartitionAndSortWithinPartitions
in order to understand the function. I have got so far (not working, the distinct throws the values around so that they get out of order)
def partval(partID:Int, iter: Iterator[Int]): Iterator[Tuple2[Int, Int]] = {
iter.map( x => (partID, x)).toList.iterator
}
val part20to3_chaos = sc.parallelize(1 to 20, 3).distinct
val part20to2_sorted = part20to3_chaos.repartitionAndSortWithinPartitions(2)
part20to2_sorted.mapPartitionsWithIndex(partval).collect
but get the error
Name: Compile Error
Message: <console>:22: error: value repartitionAndSortWithinPartitions is not a member of org.apache.spark.rdd.RDD[Int]
val part20to2_sorted = part20to3_chaos.repartitionAndSortWithinPartitions(2)
I tried using the scaladoc, but wasn't able to find which class provides repartitionAndSortWithinPartitions
. (Btw: This scaladoc is not impressive: Why is MapPartitionsRDD
missing? How can I search for a method?)
Realising I need a partitioner object, next I tried to
val rangePartitioner = new org.apache.spark.RangePartitioner(2, part20to3_chaos)
val part20to2_sorted = part20to3_chaos.repartitionAndSortWithinPartitions(rangePartitioner)
part20to2_sorted.mapPartitionsWithIndex(partval).collect
but got
Name: Compile Error
Message: <console>:22: error: type mismatch;
found : org.apache.spark.rdd.RDD[Int]
required: org.apache.spark.rdd.RDD[_ <: Product2[?,?]]
Error occurred in an application involving default arguments.
val rPartitioner = new org.apache.spark.RangePartitioner(2, part20to3_chaos)
How do I get this to compile? Could I get a working example, please?