In Spark, how can I process and write files in parallel, given a list of filenames?

Question

I have a list of filenames and paths stored in a .txt in HDFS, which are paths to files also stored in HDFS. I would like to call a function that does some parsing and then writes to HDFS in parallel. This is what I have so far:

val files = sc.textfile("Filenames.txt")
val paths = files.map( line => line.split(" ") )
sc.textfile.collect().foreach( paths(1) => parseAndWrite( paths(1) ) )

However, Spark ends up parsing each file one at a time, instead of in parallel. I've also tried skipping the collect with another map, and using .par from scala collections parVector, to no avail. How can I best approach this?

Edit:

parseAndWrite would consist of something like this:

def parseAndWrite(filepath: String): Unit = {
    val df = spark.read.format("csv").load(filepath)

    // do some parsing logic on df here


    dfParsed.write.format("csv").save(anotherfilepath)

}

It's not very clear what you want to be the input of parseAndWrite. Mentioning what logic (if you can simplify) is there in parseAndWrite, would help. If you are ok with all file contents across files being distributed and processed, then the answer by @Steven can be followed. If not, then each input file content need to be loaded in individual rdd's and processed, to achieve parallelism that you desire. — sujit, Mar 21 '18 at 12:26

score 0 · Answer 1 · answered Mar 21 '18 at 05:18

0

See
https://stackoverflow.com/a/24036343/5568528

You can use wildcards
sc.textfile("/home/mydir/files/*")
or, possibly, explode the array.
In python this would be
sc.textfile(*paths)

answered Mar 21 '18 at 05:18

Steven Black

1,988
1
15
25

In Spark, how can I process and write files in parallel, given a list of filenames?

1 Answers1