1

I have 2 paired RDDs that I joined them together using the same key and I now I want to sort the result using one of the values. The new joined RDD type is : RDD[((String, Int), Iterable[((String, DateTime, Int,Int), (String, DateTime, String, String))])]

where the first section is the paired RDD key and the iterable part is the values from the two RDD I joined. I want now to order them by the Time field of the second RDD. I tried to use sortBy function but I got errors.

Any ideas?

Thanks

Userrrrrrrr
  • 399
  • 6
  • 18

4 Answers4

0

Spark pair RDDs have a mapValues method. I think it will help you.

    def mapValues[U](f: (V) ⇒ U): RDD[(K, U)]
    Pass each value in the key-value pair RDD through a map function 
without changing the keys; this also retains the original RDD's partitioning.

Spark Documentation has more details.

Karthik
  • 1,801
  • 1
  • 13
  • 21
0

You're right that you can use sortBy function:

val yourRdd: RDD[((String, Int), Iterable[((String, DateTime, Int,Int), (String, DateTime, String, String))])] = ...(your cogroup operation here)

val result = yourRdd.sortBy({
  case ((str, i), iter) if iter.nonEmpty => iter.head._2._
  }, true)

iter.head has type of ((String, DateTime, Int,Int), (String, DateTime, String, String));

iter.head._2 has type of (String, DateTime, String, String) and

iter.head._2._2 is indeed has type of DateTime.

And maybe you should provide implicit ordering object for Datetime like this. By the way, may the iterator be emtpy? Then you should add this case to sortBy function. And if there are many items in this iterator which one to choose for sorting?

Community
  • 1
  • 1
Nikita
  • 4,435
  • 3
  • 24
  • 44
  • thanks @ipoteka, I still get errors. this is the code I'm using: val mappedDF = firstRDD.join(secondRDD).groupByKey() val res = mappedDF.sortBy( {case ((str, i), iter) if iter.nonEmpty => iter(0)._2._2} , true); when standing on iter(0) the error is: Iterable[((String, DateTime, Int,Int), (String, DateTime, String, String))] does not take parameters Error occurred in an application involving default arguments. – Userrrrrrrr Apr 14 '15 at 08:10
  • Ah, sorry. It indeed may not have this method. But it must support `head` operation: http://www.scala-lang.org/api/2.11.4/index.html#scala.collection.Iterable So i edited my answer to be accurate. – Nikita Apr 14 '15 at 08:16
0

If the RDD's Iterable needs to be sorted:

val rdd: RDD[((String, Int), 
             Iterable[((String, DateTime, Int,Int), 
                       (String, DateTime, String, String))])] = ???

val dateOrdering = new Ordering[org.joda.time.DateTime]{ 
    override def compare(a: org.joda.time.DateTime,
                         b: org.joda.time.DateTime) = 
        if (a.isBefore(b)) -1 else 1
}

rdd.mapValues(v => v.toArray
                    .sortBy(x => x._2._2)(dateOrdering))
Shyamendra Solanki
  • 8,751
  • 2
  • 31
  • 25
0

Using python:

sortedRDD = unsortedRDD.sortBy(lambda x:x[1][1], False)

This will sort by descending order

Reza Ghorbani
  • 2,396
  • 2
  • 28
  • 33