0

I am trying to iterate JavaPairRDD and apply some transformation on Value(which is Java Model class, Key is String) and returning the same Key Value Pair as JavaPairRDD.

Before throwing outofMemoryError it says Marking Stage 5 (saveAsTextFile at AppDaoImpl.java:219) as failed due to a fetch failure from Stage 1 (mapToPair at AppDataUtil.java:221)

Is there anyway can we optimise the below code, it seems to me very simple code. but when i process huge file i am facing this outofmemoryerror.

i am also passing the below parameters.

--num-executors 20  --executor-memory 12288M --executor-cores 5 --driver-memory 6G --conf spark.yarn.executor.memoryOverhead=1332

Sample Code Is:

return parquetFileContent.mapToPair(tuple -> {
                SimpleGroup simpleGroup = tuple._2();
                Model inputData = applyTransformationLogic(simpleGroup);
                return new Tuple2<String, Model>(inputData.getSomeStringField(), inputData);
            });

Before calling saveAsTextFile(), i am adding three RDD's using union and calling this method.

javaSparkCtx.union(rdd1, rdd2, rdd3).saveAsTextFile("hdfs filepath");

i wanted to write all the rdd's to same location, so i am using union Is it possible to call each rdd seperately in the same location?

log trace is:

15/12/18 15:47:39 INFO scheduler.DAGScheduler: Marking Stage 5 (saveAsTextFile at AppDaoImpl.java:219) as failed due to a fetch failure from Stage 1 (mapToPair at AppDataUtil.java:221)
15/12/18 15:47:39 INFO scheduler.DAGScheduler: Stage 5 (saveAsTextFile at AppDaoImpl.java:219) failed in 78.951 s
15/12/18 15:47:39 INFO scheduler.DAGScheduler: Resubmitting Stage 1 (mapToPair at AppDataUtil.java:221) and Stage 5 (saveAsTextFile at AppDaoImpl.java:219) due to fetch failure
15/12/18 15:47:39 INFO scheduler.DAGScheduler: Executor lost: 2 (epoch 5)
15/12/18 15:47:39 INFO storage.BlockManagerMasterActor: Trying to remove executor 2 from BlockManagerMaster.
15/12/18 15:47:39 INFO storage.BlockManagerMasterActor: Removing block manager BlockManagerId(2, lpdn0185.com, 37626)
15/12/18 15:47:39 INFO storage.BlockManagerMaster: Removed 2 successfully in removeExecutor
15/12/18 15:47:39 INFO scheduler.Stage: Stage 1 is now unavailable on executor 2 (26/56, false)
15/12/18 15:47:39 INFO scheduler.Stage: Stage 2 is now unavailable on executor 2 (25/56, false)
15/12/18 15:47:39 WARN scheduler.TaskSetManager: Lost task 2.1 in stage 5.0 (TID 119, lpdn0185.com): FetchFailed(BlockManagerId(2, lpdn0185.com, 37626), shuffleId=4, mapId=0, reduceId=2, message=
org.apache.spark.shuffle.FetchFailedException: Error in opening FileSegmentManagedBuffer{file=/hdfs1/yarn/nm/usercache/phdpentcustcdibtch/appcache/application_1449986083135_60217/blockmgr-34a2e882-6b36-42c6-bcff-03d9bc5ef80b/2c/shuffle_4_0_0.data, offset=3038022, length=2959077}
    at org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$.org$apache$spark$shuffle$hash$BlockStoreShuffleFetcher$$unpackBlock$1(BlockStoreShuffleFetcher.scala:67)
abaghel
  • 14,783
  • 2
  • 50
  • 66
Shankar
  • 8,529
  • 26
  • 90
  • 159
  • where did you save the SparkRDD `parquetFileContent`? – Anil Dec 19 '15 at 09:27
  • @Anil, i am not using cache, it's stored in disk. – Shankar Dec 19 '15 at 10:10
  • 1
    Could you share more information about the _applyTransformationLogic_ function (on its resource consumption)? If you're not already using, give [kyro serialization](http://spark.apache.org/docs/latest/tuning.html#data-serialization) a try! – Varadharajan Mukundan Dec 19 '15 at 18:30
  • @VaradharajanMukundan : Thanks for the Kyro Serialization, definitely i will try, in my model class i have around 40 properties, applyTransformationLogic method will get the data from parquet file (SimpleGroup Class) and set it in the model class properties. – Shankar Dec 20 '15 at 16:21
  • @VaradharajanMukundan : Can i use Kyro Serialization if i am not using cache also? will there be any benefit in that case? – Shankar Dec 20 '15 at 16:35
  • 1
    Are you positive that you memoryOverhead is set high enough? Otherwise Yarn will kill your executors. Have a look at the resourcemanager log file (located on the node running the application manager). You want to look for something along the lines of "running beyond physical memory limits" and "killing container". – Glennie Helles Sindholt Dec 21 '15 at 08:00
  • @GlennieHellesSindholt : Thanks , how do i set the correct memoryOverHead size? – Shankar Dec 21 '15 at 08:45
  • I'm afraid all you can currently do is trial and error :-( My understanding is that Spark 1.6 will address at least some of these issues, but until then... – Glennie Helles Sindholt Dec 21 '15 at 08:52
  • @GlennieHellesSindholt : Thanks, i will try that.. – Shankar Dec 21 '15 at 09:40
  • @Ramesh IIRC serialization comes into picture whenever an object needs to passed on to different stages where it could potentially be transferred onto another node. I'm not sure how much benefit it will add in, but i think its very straightforward to give it a try. – Varadharajan Mukundan Dec 22 '15 at 06:21

0 Answers0