3

When I run my spark python code as below:

import pyspark
conf = (pyspark.SparkConf()
     .setMaster("local")
     .setAppName("My app")
     .set("spark.executor.memory", "512m"))
sc = pyspark.SparkContext(conf = conf)        #start the conf
data =sc.textFile('/Users/tsangbosco/Downloads/transactions')
data = data.flatMap(lambda x:x.split()).take(all)

The file is about 20G and my computer have 8G ram, when I run the program in standalone mode, it raises the OutOfMemoryError:

Exception in thread "Local computation of job 12" java.lang.OutOfMemoryError: Java heap space
    at org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:131)
    at org.apache.spark.api.python.PythonRDD$$anon$1.next(PythonRDD.scala:119)
    at org.apache.spark.api.python.PythonRDD$$anon$1.next(PythonRDD.scala:112)
    at scala.collection.Iterator$class.foreach(Iterator.scala:727)
    at org.apache.spark.api.python.PythonRDD$$anon$1.foreach(PythonRDD.scala:112)
    at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
    at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
    at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
    at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
    at org.apache.spark.api.python.PythonRDD$$anon$1.to(PythonRDD.scala:112)
    at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
    at org.apache.spark.api.python.PythonRDD$$anon$1.toBuffer(PythonRDD.scala:112)
    at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
    at org.apache.spark.api.python.PythonRDD$$anon$1.toArray(PythonRDD.scala:112)
    at org.apache.spark.api.java.JavaRDDLike$$anonfun$1.apply(JavaRDDLike.scala:259)
    at org.apache.spark.api.java.JavaRDDLike$$anonfun$1.apply(JavaRDDLike.scala:259)
    at org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:884)
    at org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:884)
    at org.apache.spark.scheduler.DAGScheduler.runLocallyWithinThread(DAGScheduler.scala:681)
    at org.apache.spark.scheduler.DAGScheduler$$anon$1.run(DAGScheduler.scala:666)

Is spark unable to deal with file larger than my ram? Could you tell me how to fix it?

Sam
  • 7,252
  • 16
  • 46
  • 65
BoscoTsang
  • 394
  • 1
  • 14
  • http://stackoverflow.com/questions/21138751/spark-java-lang-outofmemoryerror-java-heap-space/22742982#22742982 – samthebest Jun 12 '14 at 16:30

1 Answers1

5

Spark can handle some case. But you are using take to force Spark to fetch all of the data to an array(in memory). In such case, you should store them to files, like using saveAsTextFile.

If you are interested in looking at some of data, you can use sample or takeSample.

zsxwing
  • 20,270
  • 4
  • 37
  • 59
  • If I want to use mlib to train all the features samples in the file how can I send all samples to train? Thanks – BoscoTsang May 12 '14 at 10:12
  • Actually, mlib has many algorithms. Which one do you want to use? – zsxwing May 12 '14 at 13:47
  • For example, I want to train the data with Logistic Regression. The file has several columns data and each column is a feature. How can I train all the samples in a feature into Logistic Regression? – BoscoTsang May 13 '14 at 00:58