2

I have data set like this which I am taking from csv file and converting it into RDD using scala.

+-----------+-----------+----------+
| recent    | Freq      | Monitor  |
+-----------+-----------+----------+
|        1  |       1234 |   199090|
|        4  |       2553|    198613|
|        6  |       3232 |   199090|
|        1  |       8823 |   498831|
|        7  |       2902 |   890000|
|        8  |       7991 |   081097|
|        9  |       7391 |   432370|
|        12 |       6138 |   864981|
|        7  |       6812 |   749821|
+-----------+-----------+----------+

How to sort the data on all columns ?

Thanks

justAbit
  • 4,226
  • 2
  • 19
  • 34
Niranjanp
  • 301
  • 2
  • 5
  • 15
  • Which columns do you want to sort on first? You can't sort on all the columns at once, you have to sort on one column first, THEN sort by the next column, etc. We need more information. – Katya Willard Apr 19 '16 at 11:51
  • 3
    Possible duplicate of [Sorting by multiple fields in Apache Spark](http://stackoverflow.com/questions/34379516/sorting-by-multiple-fields-in-apache-spark) – Tzach Zohar Apr 19 '16 at 11:55
  • Sorry i am new to spark and scala. Actually i want first column should be sorted in descending order and then i need to sort next two columns in ascending order. I need to give the rank as well. – Niranjanp Apr 19 '16 at 12:09
  • Did you already tried something? could you post some code you wrote and the result you achieved? – Basile Perrenoud Apr 19 '16 at 12:17
  • I tried to convert into key value pair and then used sortBykey() method but i couldn't get the output. – Niranjanp Apr 19 '16 at 12:24
  • val csv = sc.textFile("ranked_data.csv") // create key-value pair val pairs = csv.map(x => (x.split(",")(0),x.split(",")(1),x.split(",")(2))) val res = pairs.sortBykey() – Niranjanp Apr 19 '16 at 12:24
  • It would be helpful for me if you give me some spark scala example similar to my problem. – Niranjanp Apr 19 '16 at 12:27
  • Also a duplicate of this: http://stackoverflow.com/questions/36393224/spark-sort-an-rdd-by-multiple-values-in-a-tuple-columns – The Archetypal Paul Apr 19 '16 at 12:33

2 Answers2

8

Suppose your input RDD/DataFrame is called df.

To sort recent in descending order, Freq and Monitor both in ascending you can do:

import org.apache.spark.sql.functions._

val sorted = df.sort(desc("recent"), asc("Freq"), asc("Monitor"))

You can use df.orderBy(...) as well, it's an alias of sort().

Shaido
  • 27,497
  • 23
  • 70
  • 73
Steve
  • 161
  • 3
  • 5
1

csv.sortBy(r => (r.recent, r.freq)) or equivalent should do it

Zahiro Mor
  • 1,708
  • 1
  • 16
  • 30