0

I have a Spark RDD whose entries I want to sort in an organized manner. Let's say the entry is a tuple with 3 elements (name,phonenumber,timestamp). I want to sort the entries first depending on the value of phonenumber and then depending on the value of timestamp while respecting and not changing the sort that was done based on phonenumber. (so timestamp only re-arranges based on the phonenumber sort). Is there a Spark function to do this?

(I am using Spark 2.x with Scala)

Mnemosyne
  • 1,162
  • 4
  • 13
  • 45

2 Answers2

5

In order to do the sorting based on Multiple elements in RDD, you can use sortBy function. Please find below some sample code in Python. you can similarly implement in other languages as well.

tmp = [('a', 1), ('a', 2), ('1', 3), ('1', 4), ('2', 5)]

sc.parallelize(tmp).sortBy(lambda x: (x[0], x[1]), False).collect()

Regards,

Neeraj

philantrovert
  • 9,904
  • 3
  • 37
  • 61
Neeraj Bhadani
  • 2,930
  • 16
  • 26
1

You can use sortBy function on RDD as below

val df = spark.sparkContext.parallelize(Seq(
  ("a","1", "2017-03-10"),
  ("b","12", "2017-03-9"),
  ("b","123", "2015-03-12"),
  ("c","1234", "2015-03-15"),
  ("c","12345", "2015-03-12")
))//.toDF("name", "phonenumber", "timestamp")

df.sortBy(x => (x._1, x._3)).foreach(println)

Output:

(c,1234,2015-03-15)
(c,12345,2015-03-12)
(b,12,2017-03-9)
(b,123,2015-03-12)
(a,1,2017-03-10)

If you have a dataframe with toDF("name", "phonenumber", "timestamp") Then you could simply do

df.sort("name", "timestamp")

Hope this helps!

koiralo
  • 22,594
  • 6
  • 51
  • 72