I have 3 files in HDFS and would like to use most efficient way to sort them first on 1st column and then on 2nd column and store sorted result back to a new file on HDFS using Scala (or Python) in Spark 1.4.1:
hdfs:///test/2016/file.csv
hdfs:///test/2015/file.csv
hdfs:///test/2014/file.csv
Files look like this (no header):
hdfs:///test/2016/file.csv
127,56,abc
125,56,abc
121,56,abc
hdfs:///test/2016/file.csv
126,66,abc
122,56,abc
123,46,abc
hdfs:///test/2016/file.csv
122,66,abc
128,56,abc
123,16,abc
Sorted output want to save to HDFS:
hdfs:///test/output/file.csv
121,56,abc
122,56,abc
122,66,abc
123,16,abc
123,46,abc
125,56,abc
126,66,abc
127,56,abc
128,56,abc
I am very new to Spark and so far I only know how to load file:
val textFile = sc.textFile("hdfs:///test/2016/file.csv")
Tried to read on internet on how to sort but not clear on what libraries should work for this case (CSV files) and this version of Spark (1.4.1) and how to use them.. Please help, Joe