Memory efficient way of union a sequence of RDDs from Files in Apache Spark

Question

I'm currently trying to train a set of Word2Vec Vectors on the UMBC Webbase Corpus (around 30GB of text in 400 files).

I often run into out of memory situations even on 100 GB plus Machines. I run Spark in the application itself. I tried to tweak a little bit, but I am not able to perform this operation on more than 10 GB of textual data. The clear bottleneck of my implementation is the union of the previously computed RDDs, that where the out of memory exception comes from.

Maybe one you have the experience to come up with a more memory efficient implementation than this:

 object SparkJobs {
  val conf = new SparkConf()
    .setAppName("TestApp")
    .setMaster("local[*]")
    .set("spark.executor.memory", "100g")
    .set("spark.rdd.compress", "true")

  val sc = new SparkContext(conf)


  def trainBasedOnWebBaseFiles(path: String): Unit = {
    val folder: File = new File(path)

    val files: ParSeq[File] = folder.listFiles(new TxtFileFilter).toIndexedSeq.par


    var i = 0;
    val props = new Properties();
    props.setProperty("annotators", "tokenize, ssplit");
    props.setProperty("nthreads","2")
    val pipeline = new StanfordCoreNLP(props);

    //preprocess files parallel
    val training_data_raw: ParSeq[RDD[Seq[String]]] = files.map(file => {
      //preprocess line of file
      println(file.getName() +"-" + file.getTotalSpace())
      val rdd_lines: Iterator[Option[Seq[String]]] = for (line <- Source.fromFile(file,"utf-8").getLines) yield {
          //performs some preprocessing like tokenization, stop word filtering etc.
          processWebBaseLine(pipeline, line)    
      }
      val filtered_rdd_lines = rdd_lines.filter(line => line.isDefined).map(line => line.get).toList
      println(s"File $i done")
      i = i + 1
      sc.parallelize(filtered_rdd_lines).persist(StorageLevel.MEMORY_ONLY_SER)

    })

    val rdd_file =  sc.union(training_data_raw.seq)

    val starttime = System.currentTimeMillis()
    println("Start Training")
    val word2vec = new Word2Vec()

    word2vec.setVectorSize(100)
    val model: Word2VecModel = word2vec.fit(rdd_file)

    println("Training time: " + (System.currentTimeMillis() - starttime))
    ModelUtil.storeWord2VecModel(model, Config.WORD2VEC_MODEL_PATH)  
  }}
}

30GB data in files... will certainly result in more than 100GB of Java objects... Do it so that only one file is in memory at one time... process it... then load the next one. — sarveshseri, Feb 05 '15 at 14:23
I need to process them at once because in the model fitting step all data needs to be present — dice89, Feb 05 '15 at 18:25
I believe you can train the model multiple times. Well... that depends on your implmentation. — sarveshseri, Feb 05 '15 at 18:32
Try using `StorageLevel.MEMORY_AND_DISK` everywhere...may be that will help. It will definitely be many times slower... but will work. — sarveshseri, Feb 05 '15 at 18:34

score 1 · Answer 1 · answered Feb 05 '15 at 20:22

Like Sarvesh points out in the comments, it is probably too much data for a single machine. Use more machines. We typically see the need for 20–30 GB of memory to work with a file of 1 GB. By this (extremely rough) estimate you'd need 600–800 GB of memory for the 30 GB input. (You can get a more accurate estimate by loading a part of the data.)

As a more general comment, I'd suggest you avoid using rdd.union and sc.parallelize. Use instead sc.textFile with a wildcard to load all files into a single RDD.

score 0 · Answer 2 · answered Feb 20 '15 at 09:38

Have you tried getting word2vec vectors from a smaller corpus? I tell you this cause I was running the word2vec spark implementation on a much smaller one and I got issues with it cause there is this issue: http://mail-archives.apache.org/mod_mbox/spark-issues/201412.mbox/%3CJIRA.12761684.1418621192000.36769.1418759475999@Atlassian.JIRA%3E

So for my use case that issue made the word2vec spark implementation a bit useless. Thus I used spark for massaging my corpus but not for actually getting the vectors.

As other suggested stay away from calling rdd.union.
Also I think .toList will probably gather every line from the RDD and collect it in your Driver Machine ( the one used to submit the task) probably this is why you are getting out-of-memory. You should totally avoid turning the RDD into a list!

Memory efficient way of union a sequence of RDDs from Files in Apache Spark

2 Answers2