0

How to roll up data of one file based on one variable before joining it to other file in spark ? I need to join the two files so that there should not be any repeated key of that column . Example : Data of one file

name,country,marks,score
a,India,12,11
b,Australia,10,9
a,England,12,10
a,America,11,18
b,India,16,12
c,America,17,22

Data of second file

name2,City,ID
a,Delhi,we1
b,Bangalore,we2
a,Gurgaon,we1
a,Mumbai,we3
c,Delhi,we4

After rolling first file, it should be like

name,country,marks,score
a,India England America,12 12 11, 11 10 18
b,Australia India,10 16, 9 12 
c,America,17,22

After rolling second file, it should be like

a, Delhi Gurgaon Mumbai,we1 we1 we3
b,Bangalore, we2
c,Delhi ,we4

and after rolling these files, I want to do left join, right join and other types join in Spark.

N.Mittal
  • 19
  • 1
  • 7
  • i was trying to use groupBy by "name" variable . but output is RDD[String, Iterable[String]] – N.Mittal Jan 07 '16 at 12:36
  • val line = sc.textFile("file1.csv") line: org.apache.spark.rdd.RDD[String] =MapPartitionsRDD val withoutheader = dropheader(data1) withoutheader: org.apache.spark.rdd.RDD[String] =MapPartitionsRDD val group = withouheader.groupBy(l=>{ val parts= l.split(',') (parts(0))}) group: org.apache.spark.rdd.RDD[(String, Iterable[String])] = ShuffledRDD[16] at groupBy output after group.collect is : Array[(String, Iterable[String])] = Array((b,CompactBuffer(b,Australia,10,9, b,India,16,12)), (a,CompactBuffer(a,India,12,11, a,England,12,10, a,America,11,18)), (c,CompactBuffer(c,America,17,22))) – N.Mittal Jan 07 '16 at 13:06
  • but the expected output after rolling up should be Array[(String,[String])] = Array((b,(b,Australia India ,10 16, 9 12 )), (a,(a,India England America ,12 12 11 ,11 10 18)), (c,(c,America,17,22)) – N.Mittal Jan 07 '16 at 13:11

0 Answers0