I have a list of filenames and paths stored in a .txt in HDFS, which are paths to files also stored in HDFS. I would like to call a function that does some parsing and then writes to HDFS in parallel. This is what I have so far:
val files = sc.textfile("Filenames.txt")
val paths = files.map( line => line.split(" ") )
sc.textfile.collect().foreach( paths(1) => parseAndWrite( paths(1) ) )
However, Spark ends up parsing each file one at a time, instead of in parallel. I've also tried skipping the collect with another map, and using .par from scala collections parVector, to no avail. How can I best approach this?
Edit:
parseAndWrite would consist of something like this:
def parseAndWrite(filepath: String): Unit = {
val df = spark.read.format("csv").load(filepath)
// do some parsing logic on df here
dfParsed.write.format("csv").save(anotherfilepath)
}