1

I have 2 collections one is 'list' and another 'pairRdd2' which contains data as mentioned below.

I am trying to apply filter with containsAll where in if mypairRdd2 contains all the values mentioned in list. Expected result is joe,{US,UK}

List<String> list = Arrays.asList("US","UK");

JavaRDD pairRdd = ctx.parallelize(Arrays.asList(new Tuple2("john","US"),new Tuple2("john","UAE"),new Tuple2("joe","US"),new Tuple2("joe","UK")));

JavaPairRDD<String, String> pairRdd2 = JavaPairRDD.fromJavaRDD(pairRdd);

pairRdd2.groupByKey().filter(x-> Arrays.asList(x._2).containsAll(list)).foreach(new VoidFunction<Tuple2<String,Iterable<String>>>() {

    @Override
    public void call(Tuple2<String, Iterable<String>> t) throws Exception {
        System.out.println(t._1());             
    }
});

Can someone highlight what am i doing wrong...

Banana
  • 2,435
  • 7
  • 34
  • 60
Jack
  • 197
  • 1
  • 21

1 Answers1

1

The problem is with Arrays.asList(). This creates a list of Iterables, which is not what you need to perform the filter. You should use the list given by groupBy itself:

    pairRdd2.groupByKey().filter(f -> {
        Set<String> set = new HashSet<>();
        for(String s: f._2())
            set.add(s);

        return list.containsAll(set);
    });

You may also find a quick way to convert an iterable/iterator to a collection and avoid the loop altogether.

ernest_k
  • 44,416
  • 5
  • 53
  • 99
  • Thanks Ernest for your valuable reply, i just had one query that is : If i change containsAll to contains, it does not generate any results however it should have return some results. – Jack Feb 20 '18 at 13:22
  • i know that contains will take one object, but here i was trying to implement 'ANY' of the list and 'ALL' of the list functions using contians. – Jack Feb 20 '18 at 13:30
  • Not sure I understand your comment, but if you notice, the collection used in call to containsAll is a set created from the group by values. If you want to check that list contains any of the values rather than all, then it's just the last statement that must change. But it had to change from that containsAll that used a list of iterables. – ernest_k Feb 20 '18 at 14:05