How to use Dataset to group by, but entire rows

Question

Reading the this post I wonder how can we group the a Dataset but with multiple columns.

Like:

val test = Seq(("New York", "Jack", "jdhj"),
    ("Los Angeles", "Tom", "ff"),
    ("Chicago", "David", "ff"),
    ("Houston", "John", "dd"),
    ("Detroit", "Michael", "fff"),
    ("Chicago", "Andrew", "ddd"),
    ("Detroit", "Peter", "dd"),
    ("Detroit", "George", "dkdjkd")
  )

I would like to get

Chicago, [( "David", "ff"), ("Andrew", "ddd")]

score 1 · Accepted Answer · answered Mar 06 '18 at 15:25

Create a case class as below

case class TestData (location: String, name: String, value: String)

Dummy Data

val test = Seq(("New York", "Jack", "jdhj"),
    ("Los Angeles", "Tom", "ff"),
    ("Chicago", "David", "ff"),
    ("Houston", "John", "dd"),
    ("Detroit", "Michael", "fff"),
    ("Chicago", "Andrew", "ddd"),
    ("Detroit", "Peter", "dd"),
    ("Detroit", "George", "dkdjkd")
  )
//change each row to TestData object 
    .map(x => TestData(x._1, x._2, x._3))
    .toDS() // create dataset from above data

Output as you require

test.groupBy($"location")
    .agg(collect_list(struct("name", "value")).as("data"))
    .show(false)

Output:

+-----------+--------------------------------------------+
|location   |data                                        |
+-----------+--------------------------------------------+
|Los Angeles|[[Tom,ff]]                                  |
|Detroit    |[[Michael,fff], [Peter,dd], [George,dkdjkd]]|
|Chicago    |[[David,ff], [Andrew,ddd]]                  |
|Houston    |[[John,dd]]                                 |
|New York   |[[Jack,jdhj]]                               |
+-----------+--------------------------------------------+

Ramesh Maharjan · Answer 2 · 2018-03-06T15:39:56.013

I have suggested a case class way in the link that you have provided in the question. Here's something different.

RDD way

You can simply do the following

val rdd = sc.parallelize(test)      //creating rdd from test
val resultRdd = rdd.groupBy(x => x._1)              //grouping by the first element
  .mapValues(x => x.map(y => (y._2, y._3)))  //collecting the second and third element in the grouped datset

resultRdd.foreach(println) should give you

(New York,List((Jack,jdhj)))
(Houston,List((John,dd)))
(Chicago,List((David,ff), (Andrew,ddd)))
(Detroit,List((Michael,fff), (Peter,dd), (George,dkdjkd)))
(Los Angeles,List((Tom,ff)))

Converting rdd to dataframe

If you require output in table format you can just call .toDF() after some manipulation as

val df = resultRdd.map(x => (x._1, x._2.toArray)).toDF()

df.show(false) should give you

+-----------+--------------------------------------------+
|_1         |_2                                          |
+-----------+--------------------------------------------+
|New York   |[[Jack,jdhj]]                               |
|Houston    |[[John,dd]]                                 |
|Chicago    |[[David,ff], [Andrew,ddd]]                  |
|Detroit    |[[Michael,fff], [Peter,dd], [George,dkdjkd]]|
|Los Angeles|[[Tom,ff]]                                  |
+-----------+--------------------------------------------+

How to use Dataset to group by, but entire rows

2 Answers2