2

I have an pairRDD with tuples being in the following form:

[(1,"b1","c1","d1","e1"), (2,"b2","c2","d2","e2"), ...

What I want is to transform the above into a key-value pair RDD, where the first field will be the key, and the second field a list of strings (value). i.e. I want to turn it to the form:

[(1,["b1","c1","d1","e1"]), (2,["b2","c2","d2","e2"]), ...

After this, is it then possible to access any field that I want?

For example, can I access the tuple (1,["b1","c1","d1","e1"]), then extract only the field d1?

hammad
  • 117
  • 4
  • 11

2 Answers2

4

If you have an RDD with Tuples, however the Tuples are represented, you can use mapToPair to transform your RDD of Tuple into a PairRDD with Key and Value as preferred.

In Java 8 this could be

JavaPairRDD<Integer,List<String>> r = 
  rddOfTuples.mapToPair((t)->new Tuple2(
      extractKey(t),
      extractTuples(t)
  ));

Note that this operation will introduce a shuffle.

To state the obvious, extractKey and extractTuples are to be methods to be implemented extracting the parts of the original tuple as needed.

With my limited knowledge of Scala Tuples, And assuming the input is something like scala.Tuple5<String,Integer,Integer,Integer,Integer>, this could be:

JavaPairRDD<Integer,List<String>> r = 
  rddOfTuples.mapToPair((t)->new Tuple2(
      t._1,
      Arrays.asList(t._2,t._3,t._4,t._6)
  ));

If however, you do not know beforehand the arity (number of elements) of your Tuple, then in scala terms, it is a Product. To access your elements dynamically, you will need to use the Product interface, with a choice of:

  • int productArity()
  • Object productElement(int n)
  • Iterator<Object> productIterator()

Then it becomes a regular Java exercise:

JavaPairRDD<Integer,List<String>> r = 
  rddOfTuples.mapToPair((t)->{
    List<String> l = new ArrayList<>(t.productArity()-1);
    for (int i = 1; i < t.productArity(); i++) {
      l.set(i-1,t.productElement(i));
    }
    return new Tuple2<>(t._1,l);
  }));

I hope I have it all right ... this code above is untested/uncompiled ... So if you can get it to work with corrections, then feel free to apply the corrections in this answer ...

YoYo
  • 9,157
  • 8
  • 57
  • 74
1

You could try using a map function, eg in Scala:

rdd.map { case (k,v1,v2,v3,v4) => (k,(v1,v2,v3,v4)) }

Or rdd.groupBy could also be used but this could be inefficient on large data sets.

PJ Fanning
  • 953
  • 5
  • 13
  • how can i do it in java – hammad Jul 10 '16 at 20:35
  • I'm not as familiar with using Spark in Java but this code might help: http://www.programcreek.com/java-api-examples/index.php?source_dir=oryx-master/app/oryx-app-mllib/src/main/java/com/cloudera/oryx/app/batch/mllib/als/Evaluation.java - you can create a Function that takes Tuple5 as input and returns a Tuple2> – PJ Fanning Jul 10 '16 at 20:43
  • Note that mapping it that way, still keeps it a as a non-keyed RDD, it will just be a new RDD of tuples. – YoYo Jul 10 '16 at 21:30
  • but what if v1 .. v7? that's the more important issue – thebluephantom Feb 12 '18 at 20:41