2

I have a csv file in hdfs : /hdfs/test.csv, I like to group below data using spark & scala, I need a output some this like this.

I want to group by A1...AN column based on A1 column and the output should be something like this

all the rows should be grouped like below. OUTPUt:

    JACK , ABCD, ARRAY("0,1,0,1", "2,9,2,9") 
    JACK , LMN,  ARRAY("0,1,0,3", "0,4,3,T")
    JACK,  HBC,  ARRAY("1,T,5,21", "E7,4W,5,8)

Input:

    ++++++++++++++++++++++++++++++
     name   A1      A1  A2  A3..AN
   --------------------------------
    JACK    ABCD    0   1   0   1
    JACK    LMN     0   1   0   3
    JACK    ABCD    2   9   2   9
    JAC     HBC     1   T   5   21
    JACK    LMN     0   4   3   T
    JACK    HBC     E7  4W  5   8

I need a below output in spark scala

   JACK , ABCD, ARRAY("0,1,0,1", "2,9,2,9")
   JACK , LMN,  ARRAY("0,1,0,3", "0,4,3,T")
   JACK,  HBC,  ARRAY("1,T,5,21", "E7,4W,5,8)

2 Answers2

2

You can achieve this by having the columns as an array.

import org.apache.spark.sql.functions.{collect_set, concat_ws, array, col} 

val aCols = 1.to(250).map( x -> col(s"A$x")) 
val concatCol = concat_ws(",", array(aCols : _*))

groupedDf = df.withColumn("aConcat", concatCol).
groupBy("name", "A").
agg(collect_set("aConcat"))

If you're okay with duplicates you can also use collect_list instead of collect_set.

ayplam
  • 1,943
  • 1
  • 14
  • 20
0

Your input has two different columns called A1. I will assume the groupBy category is called A, while the element to put in that final array is A1. If you load the data into a DataFrame, you can do this to achieve the output specified:

import org.apache.spark.sql.functions.{collect_set, concat_ws}


val grouped = someDF
  .groupBy($"name", $"A")
  .agg(collect_set(concat_ws(",", $"A1", $"A2", $"A3", $"A4")).alias("grouped"))
Tanjin
  • 2,442
  • 1
  • 13
  • 20
  • But here A1 TO .....An. columns are starting from A1 to An which means it will be difficult for me to if my columns are A1......to A250 – tech questions Jun 24 '18 at 01:56