1

I have a JavaPairRDD and need to group by the key and then sort it using a value inside the object MyObject.

Lets say MyObject is:

class MyObject {
    Integer order;
    String name;
}

Sample data:

1, {order:1, name:'Joseph'}
1, {order:2, name:'Tom'}
1, {order:3, name:'Luke'}
2, {order:1, name:'Alfred'}
2, {order:3, name:'Ana'}
2, {order:2, name:'Jessica'}
3, {order:3, name:'Will'}
3, {order:2, name:'Mariah'}
3, {order:1, name:'Monika'}

Expected result:

Partition 1:

1, {order:1, name:'Joseph'}
1, {order:2, name:'Tom'}
1, {order:3, name:'Luke'}

Partition 2

2, {order:1, name:'Alfred'}
2, {order:2, name:'Jessica'}
2, {order:3, name:'Ana'}

Partition 3:

3, {order:1, name:'Monika'}
3, {order:2, name:'Mariah'}
3, {order:3, name:'Will'}

I'm using the key to partition the RDD and then using MyObject.order to sort the data inside the partition.

My goal is to get only the k-first elements in each sorted partition and then reduce them to a value calculated by other MyObject attribute (AKA "the first N best of the group").

How can I do this?

Magno C
  • 1,922
  • 4
  • 28
  • 53
  • Here's a Scala implementation of what you want: https://stackoverflow.com/questions/33655467/get-topn-of-all-groups-after-group-by-using-spark-dataframe – Sohum Sachdev Sep 18 '17 at 06:17
  • I need it in Java and JavaPairRDD. You point me to Scala and DataFrame. – Magno C Sep 18 '17 at 12:11

1 Answers1

1

You can use mapPartitions:

JavaPairRDD<Long, MyObject> sortedRDD = rdd.groupBy(/* the first number */)
    .mapPartitionsToPair(x -> {
        List<Tuple2<Long, MyObject>> values = toArrayList(x);
        Collections.sort(values, (x, y) -> x._2.order - y._2.order);

        return values.iterator();
     }, true);

Two highlights:

  • toArrayList takes an Iterator and returns ArrayList. You must implement it by yourself
  • important is to have true as the second argument of mapPartitionsToPair, because it will preserve partitioning
T. Gawęda
  • 15,706
  • 4
  • 46
  • 61
  • Awesome! Seems promising... I'll test it later and give feedback soon. Thanks. – Magno C Sep 18 '17 at 12:13
  • @MagnoC No problem. Sorry if there are any mistakes in code, I don't have possibility to check it, however it should work :) – T. Gawęda Sep 18 '17 at 12:14
  • Can you show the explicit function version of the lambda? – Magno C Sep 18 '17 at 12:15
  • no problem about mistakes. I got the basics. If any I'll post the corrections when test it. – Magno C Sep 18 '17 at 12:16
  • @MagnoC You mean the anonymous class? – T. Gawęda Sep 18 '17 at 12:18
  • I mean convert `(x, y) -> x._2.order - y._2.order` to Java 7. I'm using Java 8 with lambdas but I'm not very confortable with it... I can understand the code but will be better usin explicit function signatures... Only if you have time. – Magno C Sep 18 '17 at 12:23
  • I think you gave me the professional simplest version of my last try: https://github.com/icemagno/spark/blob/bd977d88e21d97ea5133d125e66c022cff8cbe06/riographx/code/src/main/java/br/com/cmabreu/Step6.java#L66 – Magno C Sep 18 '17 at 12:30