How to sort the data on multiple columns in apache spark scala?

Question

I have data set like this which I am taking from csv file and converting it into RDD using scala.

+-----------+-----------+----------+
| recent    | Freq      | Monitor  |
+-----------+-----------+----------+
|        1  |       1234 |   199090|
|        4  |       2553|    198613|
|        6  |       3232 |   199090|
|        1  |       8823 |   498831|
|        7  |       2902 |   890000|
|        8  |       7991 |   081097|
|        9  |       7391 |   432370|
|        12 |       6138 |   864981|
|        7  |       6812 |   749821|
+-----------+-----------+----------+

How to sort the data on all columns ?

Thanks

Which columns do you want to sort on first? You can't sort on all the columns at once, you have to sort on one column first, THEN sort by the next column, etc. We need more information. — Katya Willard, Apr 19 '16 at 11:51
Possible duplicate of [Sorting by multiple fields in Apache Spark](http://stackoverflow.com/questions/34379516/sorting-by-multiple-fields-in-apache-spark) — Tzach Zohar, Apr 19 '16 at 11:55
Sorry i am new to spark and scala. Actually i want first column should be sorted in descending order and then i need to sort next two columns in ascending order. I need to give the rank as well. — Niranjanp, Apr 19 '16 at 12:09
Did you already tried something? could you post some code you wrote and the result you achieved? — Basile Perrenoud, Apr 19 '16 at 12:17
I tried to convert into key value pair and then used sortBykey() method but i couldn't get the output. — Niranjanp, Apr 19 '16 at 12:24
val csv = sc.textFile("ranked_data.csv") // create key-value pair val pairs = csv.map(x => (x.split(",")(0),x.split(",")(1),x.split(",")(2))) val res = pairs.sortBykey() — Niranjanp, Apr 19 '16 at 12:24
It would be helpful for me if you give me some spark scala example similar to my problem. — Niranjanp, Apr 19 '16 at 12:27
Also a duplicate of this: http://stackoverflow.com/questions/36393224/spark-sort-an-rdd-by-multiple-values-in-a-tuple-columns — The Archetypal Paul, Apr 19 '16 at 12:33

score 8 · Answer 1 · edited Aug 20 '19 at 02:04

8

Suppose your input RDD/DataFrame is called df.

To sort recent in descending order, Freq and Monitor both in ascending you can do:

import org.apache.spark.sql.functions._

val sorted = df.sort(desc("recent"), asc("Freq"), asc("Monitor"))

You can use df.orderBy(...) as well, it's an alias of sort().

edited Aug 20 '19 at 02:04

Shaido

27,497
23
70
73

answered Aug 18 '18 at 22:04

Steve

161
3
5

score 1 · Answer 2 · answered Apr 19 '16 at 12:34

1

csv.sortBy(r => (r.recent, r.freq)) or equivalent should do it

answered Apr 19 '16 at 12:34

Zahiro Mor

1,708
1
16
30

How to sort the data on multiple columns in apache spark scala?

2 Answers2

Linked