1

I'm a beginner with Spark, and I have to regroup all data stored on several files into one.

Note : I already used Talend, and my goal is to do same thing but with Spark (scala).

Example :

File 1:

id | attr1.1 | attr1.2 | attr1.3
1  |   aaa   |   aab   |   aac
2  |   aad   |   aae   |   aaf

File 2:

id | attr2.1 | attr2.2 | attr2.3
1  |   lll   |   llm   |   lln
2  |   llo   |   llp   |   llq

File 3:

id | attr3.1 | attr3.2 | attr3.3
1  |   sss   |   sst   |   ssu
2  |   ssv   |   ssw   |   ssx

Ouput wished:

id |attr1.1|attr1.2|attr1.3|attr2.1|attr2.2|attr2.3|attr3.1|attr3.2|attr3.3
1  |  aaa  |  aab  |  aac  |  lll  |  llm  |  lln  |  sss  |  sst  |  ssu
2  |  aad  |  aae  |  aaf  |  llo  |  llp  |  llq  |  ssv  |  ssw  |  ssx

I have 9 files about orders, customers, items, ... And several hundreds of thousands of lines, that's why I have to use Spark. Fortunately, data can be tied with ids.

File format is .csv.

Final objective : Final objective is to do some visualizations from file generated by Spark.

Question : So, can you give me some clues to do this task please? I saw several ways with RDD or DataFrame but I am completely lost...

Thanks

Royce
  • 1,557
  • 5
  • 19
  • 44

1 Answers1

1

you didn't specify anything about the original file formats so assuming you've got them in dataframes f1,f2... you can create a unified dataframe by joining them val unified=f1.join(f2,f1("id")===f2("id")).join(f3, f1("id")===f3("id"))....

Arnon Rotem-Gal-Oz
  • 25,469
  • 3
  • 45
  • 68
  • Indeed, I updated my question with the format file (`.csv`). So I will create 9 dataframes and try to tied them. Thanks. – Royce Dec 12 '18 at 16:45
  • It seems to work, but there is a way to "distinct" columns? Because, I have two times the key. – Royce Dec 12 '18 at 22:16
  • 1
    If the ID column is indeed the same name in all files do a f1.join(f2,"id") if the name is different you can either change it to be the same beforehand (using f2.withColumnRenamed("oldid","id") so you can use the join above or use select after the join to retain only the needed columns – Arnon Rotem-Gal-Oz Dec 13 '18 at 02:46