I'm just starting in Spark & Scala
I have a directory with multiple files in it I successfully load them using
sc.wholeTextFiles(directory)
Now I want to go one level up. I actually have a directory that contains sub directories that contain files. My goal is to get an RDD[(String,String)]
so I can move forward, where the RDD
represents name and content of the file.
I tried the following:
val listOfFolders = getListOfSubDirectories(rootFolder)
val input = listOfFolders.map(directory => sc.wholeTextFiles(directory))
but I got a Seq[RDD[(String,String)]]
How do I transform this Seq
into an RDD[(String,String)]
?
Or maybe I'm not doing things right and I should try a different approach?
Edit: added code
// HADOOP VERSION
val rootFolderHDFS = "hdfs://****/"
val hdfsURI = "hdfs://****/**/"
// returns a list of folders (currently about 800)
val listOfFoldersHDFS = ListDirectoryContents.list(hdfsURI,rootFolderHDFS)
val inputHDFS = listOfFoldersHDFS.map(directory => sc.wholeTextFiles(directory))
// RDD[(String,String)]
// val inputHDFS2 = inputHDFS.reduceRight((rdd1,rdd2) => rdd2 ++ rdd1)
val init = sc.parallelize(Array[(String, String)]())
val inputHDFS2 = inputHDFS.foldRight(init)((rdd1,rdd2) => rdd2 ++ rdd1)
// returns org.apache.spark.SparkException: Job aborted due to stage failure: Task serialization failed: java.lang.StackOverflowError
println(inputHDFS2.count)