I have seen various threads on this issue but the solutions given are not working in my case.
The environment is with pyspark 2.1.0 , Java 7 and has enough memory and Cores.
I am running a spark-submit job which deals with Json files, the job runs perfectly alright with the file size < 200MB but if its more than that it fails for Container exited with a non-zero exit code 143 then I checked yarn logs and the error there is java.lang.OutOfMemoryError: Requested array size exceeds VM limit
Since the json file is not in the format which can directly be read using spark.read.json() the first step in the application is reading the json as textfile to rdd to apply map and flatMap to covert into required format then using spark.read.json(rdd) to create the dataframe for further processing, the code is below
def read_json(self, spark_session, filepath):
raw_rdd = spark_session.sparkContext.textFile(filepath)
raw_rdd_add_sep = raw_rdd.map(lambda x:x.replace('}{','}}{{'))
raw_rdd_split = raw_rdd_add_sep.flatMap(lambda x:x.split('}{'))
required_df = spark_session.read.json(raw_rdd_split)
return required_df
I have tried increasing the Memory overhead for executor and driver which didn't help using options spark.driver.memoryOverhead , spark.executor.memoryOverhead
Also I have enabled the Off-Heap options spark.memory.offHeap.enabled and set the value spark.memory.offHeap.size
I have tried setting the JVM memory option with spark.driver.extraJavaOptions=-Xms10g
So The above options didn't work in this scenario, some of the Json files are nearly 1GB and we ought to process ~200 files a day.
Can someone help resolving this issue please?