spark Group By data-frame columns without aggregation

Question

I have a csv file in hdfs : /hdfs/test.csv, I like to group below data using spark & scala, I need a output some this like this.

I want to group by A1...AN column based on A1 column and the output should be something like this

all the rows should be grouped like below. OUTPUt:

    JACK , ABCD, ARRAY("0,1,0,1", "2,9,2,9") 
    JACK , LMN,  ARRAY("0,1,0,3", "0,4,3,T")
    JACK,  HBC,  ARRAY("1,T,5,21", "E7,4W,5,8)

Input:

    ++++++++++++++++++++++++++++++
     name   A1      A1  A2  A3..AN
   --------------------------------
    JACK    ABCD    0   1   0   1
    JACK    LMN     0   1   0   3
    JACK    ABCD    2   9   2   9
    JAC     HBC     1   T   5   21
    JACK    LMN     0   4   3   T
    JACK    HBC     E7  4W  5   8

I need a below output in spark scala

   JACK , ABCD, ARRAY("0,1,0,1", "2,9,2,9")
   JACK , LMN,  ARRAY("0,1,0,3", "0,4,3,T")
   JACK,  HBC,  ARRAY("1,T,5,21", "E7,4W,5,8)

score 2 · Answer 1 · answered Jun 24 '18 at 02:26

2

You can achieve this by having the columns as an array.

import org.apache.spark.sql.functions.{collect_set, concat_ws, array, col} 

val aCols = 1.to(250).map( x -> col(s"A$x")) 
val concatCol = concat_ws(",", array(aCols : _*))

groupedDf = df.withColumn("aConcat", concatCol).
groupBy("name", "A").
agg(collect_set("aConcat"))

If you're okay with duplicates you can also use collect_list instead of collect_set.

answered Jun 24 '18 at 02:26

ayplam

1,943
1
14
20

Hi Ayplam, thank you, how can I use this ? val arrayofcolumns = DataFrame.columns in your code, for me A could be B,C, D, E,..or something as well – tech questions Jun 24 '18 at 02:39
I tried val arrayColumns = DataFrame.columns, worked for me, thank you for your help – tech questions Jun 24 '18 at 02:48
Great! If you could mark as the answer, that would be appreciated. – ayplam Jun 24 '18 at 05:05

score 0 · Answer 2 · answered Jun 24 '18 at 00:59

0

Your input has two different columns called A1. I will assume the groupBy category is called A, while the element to put in that final array is A1. If you load the data into a DataFrame, you can do this to achieve the output specified:

import org.apache.spark.sql.functions.{collect_set, concat_ws}


val grouped = someDF
  .groupBy($"name", $"A")
  .agg(collect_set(concat_ws(",", $"A1", $"A2", $"A3", $"A4")).alias("grouped"))

answered Jun 24 '18 at 00:59

Tanjin

2,442
1
13
20

But here A1 TO .....An. columns are starting from A1 to An which means it will be difficult for me to if my columns are A1......to A250 – tech questions Jun 24 '18 at 01:56

spark Group By data-frame columns without aggregation

2 Answers2