1

I have seen various threads on this issue but the solutions given are not working in my case.

The environment is with pyspark 2.1.0 , Java 7 and has enough memory and Cores.

I am running a spark-submit job which deals with Json files, the job runs perfectly alright with the file size < 200MB but if its more than that it fails for Container exited with a non-zero exit code 143 then I checked yarn logs and the error there is java.lang.OutOfMemoryError: Requested array size exceeds VM limit

Since the json file is not in the format which can directly be read using spark.read.json() the first step in the application is reading the json as textfile to rdd to apply map and flatMap to covert into required format then using spark.read.json(rdd) to create the dataframe for further processing, the code is below

def read_json(self, spark_session, filepath):
        raw_rdd = spark_session.sparkContext.textFile(filepath)
        raw_rdd_add_sep =  raw_rdd.map(lambda x:x.replace('}{','}}{{'))
        raw_rdd_split = raw_rdd_add_sep.flatMap(lambda x:x.split('}{'))
        required_df = spark_session.read.json(raw_rdd_split)
        return required_df

I have tried increasing the Memory overhead for executor and driver which didn't help using options spark.driver.memoryOverhead , spark.executor.memoryOverhead

Also I have enabled the Off-Heap options spark.memory.offHeap.enabled and set the value spark.memory.offHeap.size

I have tried setting the JVM memory option with spark.driver.extraJavaOptions=-Xms10g

So The above options didn't work in this scenario, some of the Json files are nearly 1GB and we ought to process ~200 files a day.

Can someone help resolving this issue please?

Mahesh
  • 75
  • 2
  • 9

1 Answers1

1
  1. Regarding "Container exited with a non-zero exit code 143", it is probably because of the memory problem.

  2. You need to check out on Spark UI if the settings you set is taking effect.

  3. BTW, the proportion for executor.memory:overhead.memory should be about 4:1

  4. I don't know why you change the JVM setting directly spark.driver.extraJavaOptions=-Xms10g, I recommend using --driver-memory 10g instread. e.g.: spark-submit --driver-memory 10G (I remember driver-memory only works with spark-submit sometimes)

  5. from my perspective, you just need to update the four arguments to feed your machine resources:

spark.driver.memoryOverhead , 
spark.executor.memoryOverhead, 
spark.driver.memory , 
spark.executor.memory
DennisLi
  • 3,915
  • 6
  • 30
  • 66
  • Hi , Thank you for the comment. I have tried to control the four config options given in your comment with various combinations but its not working. – Mahesh Apr 08 '20 at 09:41
  • did you check the Spark UI for the memory? and how many spark workers are you running? you can update your total cluster Cores and Memory in your question. – DennisLi Apr 08 '20 at 09:50