4

I'm just starting in Spark & Scala

I have a directory with multiple files in it I successfully load them using

sc.wholeTextFiles(directory)

Now I want to go one level up. I actually have a directory that contains sub directories that contain files. My goal is to get an RDD[(String,String)] so I can move forward, where the RDD represents name and content of the file.

I tried the following:

val listOfFolders = getListOfSubDirectories(rootFolder)
val input = listOfFolders.map(directory => sc.wholeTextFiles(directory))

but I got a Seq[RDD[(String,String)]] How do I transform this Seq into an RDD[(String,String)] ?

Or maybe I'm not doing things right and I should try a different approach?

Edit: added code

// HADOOP VERSION
val rootFolderHDFS = "hdfs://****/"
val hdfsURI = "hdfs://****/**/"

// returns a list of folders (currently about 800)
val listOfFoldersHDFS = ListDirectoryContents.list(hdfsURI,rootFolderHDFS)
val inputHDFS = listOfFoldersHDFS.map(directory => sc.wholeTextFiles(directory))
// RDD[(String,String)]
//    val inputHDFS2 = inputHDFS.reduceRight((rdd1,rdd2) => rdd2 ++ rdd1)
val init = sc.parallelize(Array[(String, String)]())
val inputHDFS2 = inputHDFS.foldRight(init)((rdd1,rdd2) => rdd2 ++ rdd1)

// returns org.apache.spark.SparkException: Job aborted due to stage failure: Task serialization failed: java.lang.StackOverflowError
println(inputHDFS2.count)
m-ric
  • 5,621
  • 7
  • 38
  • 51
Stephane Maarek
  • 5,202
  • 9
  • 46
  • 87

3 Answers3

4

You can reduce on the Seq like this (concatenating the RDDs with ++):

val reduced: RDD[(String, String)] = input.reduce((left, right) => left ++ right)

A few more details why can we apply reduce here:

  • ++ is associative - it does not matter you rdda ++ (rddb ++ rddc) or (rdda ++ rddb) ++ rddc
  • assumed the Seq is nonempty (otherwise fold would be a better choice, it would require an empty RDD[(String, String)] as the initial accumulator).

Depending on the exact type of Seq, you might get a stackoverflow, so be careful and test with a larger collection, though for the standard library I think it is safe.

Gábor Bakos
  • 8,982
  • 52
  • 35
  • 52
  • hi! I was using union just before your answer, and got a StackOverflow error... now I'm using ++ and I still get one... what's wrong? org.apache.spark.SparkException: Job aborted due to stage failure: Task serialization failed: java.lang.StackOverflowError – Stephane Maarek Dec 31 '14 at 15:54
  • Hmm. In that case try `reduceRight`/`foldRight`. It might avoid that. (Which version of Scala do you use?) – Gábor Bakos Dec 31 '14 at 15:56
  • I use 2.10.4. Regarding foldRight, how would you write the function? (i need to start with an element, but I don't know how to create an emptyRDD) – Stephane Maarek Dec 31 '14 at 16:03
  • 1
    You can use `reduceRight` if you know that will never be empty. For `foldRight`: `val init = sc.parallelize(Array[(String, String)]())`, where `sc` is a `SparkContext`. (http://spark.apache.org/docs/latest/programming-guide.html) – Gábor Bakos Dec 31 '14 at 16:08
  • Can you tell what is the exact type of the returned `Seq`? – Gábor Bakos Dec 31 '14 at 16:14
  • I updated my post with the code. I'd be tempted to say Seq[RDD[(String,String)]] but I don't want to be mistaken – Stephane Maarek Dec 31 '14 at 16:17
  • I see you managed to solve the other way. Anyway, I was asking about the `getClass` of the `Seq` you got. It seems I remembered wrong and the *Right version stackoverflow for `List`s and the *Left do not. So the non-stackoverflowing alternative might be `reduceLeft` or `foldLeft` if it also uses `scala.collection.immutable.::` (aka non-empty `List`). – Gábor Bakos Dec 31 '14 at 17:23
  • 1
    A reduceRight can be more likely to stackoverflow for long lists in Scala, but that's probably not what's happening here. Stack overflows can happen when unioning lots of RDD's because the RDD's maintain their full lineage. To shorten the stack, need to periodically .cache() the result and let the intermediate RDD's go out of scope. – nairbv Jan 18 '16 at 17:06
3

You should use union provided by spark context

val rdds: Seq[RDD[Int]] = (1 to 100).map(i => sc.parallelize(Seq(i)))
val rdd_union: RDD[Int] = sc.union(rdds) 
raam86
  • 6,785
  • 2
  • 31
  • 46
2

Instead of loading each directory into a separate RDD, can you just use a path wild card to load all directories into a single RDD?

Given the following directory tree...

$ tree test/spark/so
test/spark/so
├── a
│   ├── text1.txt
│   └── text2.txt
└── b
    ├── text1.txt
    └── text2.txt

Create the RDD with a wildcard for the directory.

scala> val rdd =  sc.wholeTextFiles("test/spark/so/*/*")
rdd: org.apache.spark.rdd.RDD[(String, String)] = test/spark/so/*/ WholeTextFileRDD[16] at wholeTextFiles at <console>:37

Count is 4 as you would expect.

scala> rdd.count
res9: Long = 4

scala> rdd.collect
res10: Array[(String, String)] =
Array((test/spark/so/a/text1.txt,a1
a2
a3), (test/spark/so/a/text2.txt,a3
a4
a5), (test/spark/so/b/text1.txt,b1
b2
b3), (test/spark/so/b/text2.txt,b3
b4
b5))
Mike Park
  • 10,845
  • 2
  • 34
  • 50
  • I'm finding this solution to be incredibly slow. It seems to be single-threaded or something. I can load the same data much faster creating a list of RDD's one per directory, though then I have the problems of stack overflows when unioning them. – nairbv Jan 18 '16 at 17:08
  • @Brian - I can't see why it would be slower without seeing your implementation and what you're trying to do. Make a new post and reference this one maybe? – Mike Park Jan 18 '16 at 19:48