1

I would appreciate it if you could help me.

During implementation of spark streaming from kafka to hbase (code is attached) we have faced an issue “java.io.IOException: Connection reset by peer” (full log is attached).

This issue comes up if we work with hbase and dynamic allocation option is on in spark settings. In case we write data in hdfs (hive table) instead of hbase or if dynamic allocation option is off there are no errors found.

We have tried to change zookeeper connections, spark executor idle timeout, network timeout. We have tried to change shuffle block transfer service (NIO) but the error is still there. If we set min/max executers (less then 80) amount for dynamic allocation there are no problems too.

What may the problem be? There are a lot of almost the same problems in Jira and stack overflow, but nothing helps.

Versions:

HBase 1.2.0-cdh5.14.0
Kafka  3.0.0-1.3.0.0.p0.40
SPARK2 2.2.0.cloudera2-1.cdh5.12.0.p0.232957
hbase-client/hbase-spark(org.apache.hbase) 1.2.0-cdh5.11.1

Spark settings:

--num-executors=80
--conf spark.sql.shuffle.partitions=200
--conf spark.driver.memory=32g
--conf spark.executor.memory=32g
--conf spark.executor.cores=4

Cluster: 1+8 nodes, 70 CPU, 755Gb RAM, x10 HDD,

Log:

    18/04/09 13:51:56 INFO cluster.YarnClusterScheduler: Executor 717 on lang32.ca.sbrf.ru killed by driver.
18/04/09 13:51:56 INFO storage.BlockManagerMaster: Removed 717 successfully in removeExecutor
18/04/09 13:51:56 INFO spark.ExecutorAllocationManager: Existing executor 717 has been removed (new total is 26)
18/04/09 13:51:56 INFO cluster.YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 705.
18/04/09 13:51:56 INFO scheduler.DAGScheduler: Executor lost: 705 (epoch 45)
18/04/09 13:51:56 INFO storage.BlockManagerMasterEndpoint: Trying to remove executor 705 from BlockManagerMaster.
18/04/09 13:51:56 INFO cluster.YarnClusterScheduler: Executor 705 on lang32.ca.sbrf.ru killed by driver.
18/04/09 13:51:56 INFO storage.BlockManagerMasterEndpoint: Removing block manager BlockManagerId(705, lang32.ca.sbrf.ru, 22805, None)
18/04/09 13:51:56 INFO spark.ExecutorAllocationManager: Existing executor 705 has been removed (new total is 25)
18/04/09 13:51:56 INFO storage.BlockManagerMaster: Removed 705 successfully in removeExecutor
18/04/09 13:51:56 INFO cluster.YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 716.
18/04/09 13:51:56 INFO scheduler.DAGScheduler: Executor lost: 716 (epoch 45)
18/04/09 13:51:56 INFO storage.BlockManagerMasterEndpoint: Trying to remove executor 716 from BlockManagerMaster.
18/04/09 13:51:56 INFO cluster.YarnClusterScheduler: Executor 716 on lang32.ca.sbrf.ru killed by driver.
18/04/09 13:51:56 INFO storage.BlockManagerMasterEndpoint: Removing block manager BlockManagerId(716, lang32.ca.sbrf.ru, 28678, None)
18/04/09 13:51:56 INFO spark.ExecutorAllocationManager: Existing executor 716 has been removed (new total is 24)
18/04/09 13:51:56 INFO storage.BlockManagerMaster: Removed 716 successfully in removeExecutor
18/04/09 13:51:56 WARN server.TransportChannelHandler: Exception in connection from /10.116.173.65:57542
java.io.IOException: Connection reset by peer
        at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
        at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
        at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
        at sun.nio.ch.IOUtil.read(IOUtil.java:192)
        at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
        at io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:221)
        at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:899)
        at io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:275)
        at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119)
        at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:643)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:566)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:480)
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:442)
        at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131)
        at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
        at java.lang.Thread.run(Thread.java:748)
18/04/09 13:51:56 ERROR client.TransportResponseHandler: Still have 1 requests outstanding when connection from /10.116.173.65:57542 is closed
18/04/09 13:51:56 INFO cluster.YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 548.
feus.tigris
  • 23
  • 1
  • 4
  • Is that really an issue? In other words, is your streaming job failing, or are you losing data? Otherwise it would just be a warning about "clumsy" executor de-allocation that does not shut down connections in order and/or does not clean-up resources gracefully... – Samson Scharfrichter Apr 28 '18 at 11:10
  • Does that error appear in the _executors_ log (e.g. in the log of an executor being de-allocated, because the ZK/HBase clients are not disconnected gracefully before shutdown) then show in driver log; or does it appear only in the _driver_ log (e.g. executor stopped responding before the end of the shutdown sequence)? – Samson Scharfrichter Apr 28 '18 at 11:17
  • Did you eventually work it out? – Ion Freeman Aug 20 '19 at 15:01

2 Answers2

1

Try setting these two parameters. Also try caching the Dataframe before writing to HBase.

spark.network.timeout

spark.executor.heartbeatInterval
wandermonk
  • 6,856
  • 6
  • 43
  • 93
0

Please see my related answer here: What are possible reasons for receiving TimeoutException: Futures timed out after [n seconds] when working with Spark

It also took me a while to understand why Cloudera is stating following:

Dynamic allocation and Spark Streaming

If you are using Spark Streaming, Cloudera recommends that you disable dynamic allocation by setting spark.dynamicAllocation.enabled to false when running streaming applications.

Reference: https://www.cloudera.com/documentation/spark2/latest/topics/spark2_known_issues.html#ki_dynamic_allocation_streaming

nemeth.io
  • 11
  • 1
  • Don't think it is good idea to not use DA because you use streams, but I have tried connector from Hortonworks, I can see the same error in the log, but it works... – feus.tigris May 30 '18 at 14:14
  • The defects you reference there are about the job blowing up because the executor is killed too early. That's not going to show up as 'connection reset by peer.' Are you sure it's relevant? – Ion Freeman Aug 20 '19 at 15:05