I have a long-running EMR step that executes spark-submit on EMR client mode. Between job executions, I manually restart the Spark context before the next execution if any configuration changes, like --executor-memory
.
I'm running into the following exception when I try to restart the context with the new configuration with
currentSparkSession.close();
return SparkSession.builder().config(newConfig).getOrCreate();
19/05/23 15:52:35 ERROR SparkContext: Error initializing SparkContext.
java.lang.IllegalStateException: Spark context stopped while waiting for backend
at org.apache.spark.scheduler.TaskSchedulerImpl.waitBackendReady(TaskSchedulerImpl.scala:689)
at org.apache.spark.scheduler.TaskSchedulerImpl.postStartHook(TaskSchedulerImpl.scala:186)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:567)
at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2516)
at org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:923)
at org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:915)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:915)
.
.
.
19/05/23 15:52:35 INFO SparkContext: SparkContext already stopped.
19/05/23 15:52:35 WARN TransportChannelHandler: Exception in connection from /172.31.0.165:42556
java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
at sun.nio.ch.IOUtil.read(IOUtil.java:192)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
at io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:221)
at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:899)
at io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:275)
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:643)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:566)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:480)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:442)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131)
at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
at java.lang.Thread.run(Thread.java:748)
I tried making the thread sleep a little in case there needs to be some time between the stop and start like:
currentSparkSession.close();
Thread.sleep(5000); // Sleep 5 seconds
return SparkSession.builder().config(newConfig).getOrCreate();
but that doesn't work either. I looked at the Spark source code and it looks like currentSparkSession.close()
won't return until it's actually stopped anyways, so making the Thread sleep doesn't do anything.
I also see this in the container logs:
Error occurred during initialization of VM
Initial heap size set to a larger value than the maximum heap size
End of LogType:stdout
which confuses me because the only configured I changed between executions was --executor-memory
, and I actually DECREASED it instead of increasing.
I've found similar questions on this site like Apache Spark running spark-shell on YARN error, but these suggestions look like they're essentially just turning off some resource manager validation checks that don't look very safe to me. Any suggestions?