Pair RDD tuple comparison

Question

I am learning how use spark and scala and I am trying to write a scala spark program that receives and input of string values such as:

I initially create my pair rdd with:

val myRdd = sc.textFile(args(0)).map(line=>(line.split("\\s+"))(0),line.split("\\s+")(1))).distinct()

Now this is where I am getting stuck. In the set of values there are instances like (12,13) and (13,12). In the context of the data these two are the same instances. Simply put (a,b)=(b,a).

I need to create an RDD that has one or the other, but not both. So the result, once this is done, would look something like this:

The only way I can see it as of right now is that I need to take one tuple and compare it with the rest in the RDD to make sure it isn't the same data just swapped.

score 2 · Accepted Answer · answered Oct 13 '18 at 06:15

2

The numbers just need to be sorted before creating a tuple.

val myRdd = sc.textFile(args(0))
  .map(line => {
    val nums = line.split("\\s+").sorted
    (nums(0), nums(1))
  }).distinct

answered Oct 13 '18 at 06:15

vdep

3,541
4
28
54

Pair RDD tuple comparison

1 Answers1