Order by value in spark pair RDD after join

Question

I have 2 paired RDDs that I joined them together using the same key and I now I want to sort the result using one of the values. The new joined RDD type is : RDD[((String, Int), Iterable[((String, DateTime, Int,Int), (String, DateTime, String, String))])]

where the first section is the paired RDD key and the iterable part is the values from the two RDD I joined. I want now to order them by the Time field of the second RDD. I tried to use sortBy function but I got errors.

Any ideas?

Thanks

Improve your question to get a quick and good answer. – Kumar Apr 14 '15 at 07:15 — Kumar, Apr 14 '15 at 07:15
Show your code, and the errors. – The Archetypal Paul Apr 14 '15 at 08:44 — The Archetypal Paul, Apr 14 '15 at 08:44

score 0 · Answer 1 · answered Apr 14 '15 at 07:35

0

Spark pair RDDs have a mapValues method. I think it will help you.

    def mapValues[U](f: (V) ⇒ U): RDD[(K, U)]
    Pass each value in the key-value pair RDD through a map function 
without changing the keys; this also retains the original RDD's partitioning.

Spark Documentation has more details.

answered Apr 14 '15 at 07:35

Karthik

1,801
1
13
21

not sure I understand how it helps me. can you explain more? Thanks – Userrrrrrrr Apr 14 '15 at 07:43

score 0 · Answer 2 · edited May 23 '17 at 12:14

0

You're right that you can use sortBy function:

val yourRdd: RDD[((String, Int), Iterable[((String, DateTime, Int,Int), (String, DateTime, String, String))])] = ...(your cogroup operation here)

val result = yourRdd.sortBy({
  case ((str, i), iter) if iter.nonEmpty => iter.head._2._
  }, true)

iter.head has type of ((String, DateTime, Int,Int), (String, DateTime, String, String));

iter.head._2 has type of (String, DateTime, String, String) and

iter.head._2._2 is indeed has type of DateTime.

And maybe you should provide implicit ordering object for Datetime like this. By the way, may the iterator be emtpy? Then you should add this case to sortBy function. And if there are many items in this iterator which one to choose for sorting?

edited May 23 '17 at 12:14

Community

1
1

answered Apr 14 '15 at 07:51

Nikita

4,435
3
24
44

thanks @ipoteka, I still get errors. this is the code I'm using: val mappedDF = firstRDD.join(secondRDD).groupByKey() val res = mappedDF.sortBy( {case ((str, i), iter) if iter.nonEmpty => iter(0)._2._2} , true); when standing on iter(0) the error is: Iterable[((String, DateTime, Int,Int), (String, DateTime, String, String))] does not take parameters Error occurred in an application involving default arguments. – Userrrrrrrr Apr 14 '15 at 08:10
Ah, sorry. It indeed may not have this method. But it must support `head` operation: http://www.scala-lang.org/api/2.11.4/index.html#scala.collection.Iterable So i edited my answer to be accurate. – Nikita Apr 14 '15 at 08:16

score 0 · Answer 3 · answered Apr 14 '15 at 10:19

If the RDD's Iterable needs to be sorted:

val rdd: RDD[((String, Int), 
             Iterable[((String, DateTime, Int,Int), 
                       (String, DateTime, String, String))])] = ???

val dateOrdering = new Ordering[org.joda.time.DateTime]{ 
    override def compare(a: org.joda.time.DateTime,
                         b: org.joda.time.DateTime) = 
        if (a.isBefore(b)) -1 else 1
}

rdd.mapValues(v => v.toArray
                    .sortBy(x => x._2._2)(dateOrdering))

score 0 · Answer 4 · edited Nov 09 '19 at 10:33

0

Using python:

sortedRDD = unsortedRDD.sortBy(lambda x:x[1][1], False)

This will sort by descending order

edited Nov 09 '19 at 10:33

Reza Ghorbani

2,396
2
28
33

answered Nov 09 '19 at 07:11

Vardhaman Jain

17
4

Order by value in spark pair RDD after join

4 Answers4