I'm a beginner with Spark, and I have to regroup all data stored on several files into one.
Note : I already used Talend, and my goal is to do same thing but with Spark (scala).
Example :
File 1:
id | attr1.1 | attr1.2 | attr1.3
1 | aaa | aab | aac
2 | aad | aae | aaf
File 2:
id | attr2.1 | attr2.2 | attr2.3
1 | lll | llm | lln
2 | llo | llp | llq
File 3:
id | attr3.1 | attr3.2 | attr3.3
1 | sss | sst | ssu
2 | ssv | ssw | ssx
Ouput wished:
id |attr1.1|attr1.2|attr1.3|attr2.1|attr2.2|attr2.3|attr3.1|attr3.2|attr3.3
1 | aaa | aab | aac | lll | llm | lln | sss | sst | ssu
2 | aad | aae | aaf | llo | llp | llq | ssv | ssw | ssx
I have 9 files about orders, customers, items, ... And several hundreds of thousands of lines, that's why I have to use Spark. Fortunately, data can be tied with ids.
File format is .csv
.
Final objective : Final objective is to do some visualizations from file generated by Spark.
Question : So, can you give me some clues to do this task please? I saw several ways with RDD or DataFrame but I am completely lost...
Thanks