-3

I have a JavaPairRDD<String, List<Tuple2<Integer, Integer>>> named rddA. For example (after collecting rddA): [(word1,[(187,267), (224,311), (187,110)]), (word2,[(187,200), (10,90)])]. Thus, for example, word1 is the key and value is [(187,267), (224,311), (187,110)])].

How can I define the corresponding JavaPairRDD<Integer, List<Integer>> to get the following ouptput:

[(187, [267, 110, 200]), (224,[311]), (10,[90])]

So, the obtained JavaPairRDDincludes three keys: 187, 224 and 10. And for example, the key 187 has [267, 110, 200] as a list value.

bib
  • 944
  • 3
  • 15
  • 32
  • did you look at groupingBy? https://docs.oracle.com/javase/8/docs/api/java/util/stream/Collectors.html – Danny Fried Feb 04 '20 at 07:30
  • What you are asking for is not very clear. Could you describe the logic with words? And also tell us what you have tried? – Oli Feb 04 '20 at 10:52
  • @Oli checks plz – bib Feb 04 '20 at 15:46
  • Figuring out what you are trying to do with only one record is quite difficult. Can you provide an example with 2 or 3 records in your RDD? Ideally, can you explain the logic as well? – Oli Feb 04 '20 at 16:03
  • @Oli check plz i whish it is more clear now. thank for your cooperation – bib Feb 04 '20 at 16:27
  • Oh right, I understand what you want to do now – Oli Feb 04 '20 at 16:33
  • @Oli thank you your solultion works. But as i read we should not use groupbykey for large dataset so how can i modify your solution to use reduceByKey, please – bib Feb 04 '20 at 18:18

1 Answers1

1

You simply need to flatten the list of tuples (second value of your tuple) and group by the first element of the tuple.

JavaPairRDD<Integer, List<Integer>> result = rddA
                .flatMapValues(x -> x) // flattening the list
                .mapToPair(x -> x._2) // getting rid of the first key
                .groupByKey()
                .mapValues(x -> { // turning the iterable into a list
                    List<Integer> list = new ArrayList<>();
                    x.forEach(list::add);
                    return list;
                });
Oli
  • 9,766
  • 5
  • 25
  • 46
  • how can improve this solution by replacing groupByKey by reduceByKey/ – bib Feb 04 '20 at 18:30
  • That would only improve the solution if the goal was to filter the resulting lists. If you want to keep everything, `reduceByKey` will not help. – Oli Feb 05 '20 at 09:41