We are using Apache Cassandra-v3.0.9
with com.datastax.cassandra:cassandra-driver-core:3.1.3
. Our application works good all the time, but once in a week we start getting the following exception from our applications:
com.datastax.driver.core.exceptions.OperationTimedOutException: [xxx.xx.xx.xx/xxx.xx.xx.xx:9042] Timed out waiting for server response
at com.datastax.driver.core.exceptions.OperationTimedOutException.copy(OperationTimedOutException.java:44)
at com.datastax.driver.core.exceptions.OperationTimedOutException.copy(OperationTimedOutException.java:26)
at com.datastax.driver.core.DriverThrowables.propagateCause(DriverThrowables.java:37)
at com.datastax.driver.core.ChainedResultSetFuture.getUninterruptibly(ChainedResultSetFuture.java:62)
at com.datastax.driver.core.NewRelicChainedResultSetFuture.getUninterruptibly(NewRelicChainedResultSetFuture.java:11)
at com.datastax.driver.core.AbstractSession.execute(AbstractSession.java:68)
at com.til.cms.graphdao.cassandra.dao.generic.CassandraICMSGenericDaoImpl.getCmsEntityMapForLimitedSize(CassandraICMSGenericDaoImpl.java:2824)
.....
Caused by: com.datastax.driver.core.exceptions.OperationTimedOutException: [xxx.xx.xx.xx/xxx.xx.xx.xx:9042] Timed out waiting for server response
at com.datastax.driver.core.RequestHandler$SpeculativeExecution.onTimeout(RequestHandler.java:770)
at com.datastax.driver.core.Connection$ResponseHandler$1.run(Connection.java:1374)
at io.netty.util.HashedWheelTimer$HashedWheelTimeout.expire(HashedWheelTimer.java:581)
at io.netty.util.HashedWheelTimer$HashedWheelBucket.expireTimeouts(HashedWheelTimer.java:655)
at io.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:367)
at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:145)
These applications are hitting Cassandra datacenter for read requests. The data-center consists of 5 physical servers each with 2 disks, 64 GB RAM, 40 cores, 16GB heap with G1 GC.
There was no problem with Cassandra servers as per our investigation like there was no load average/iowait increase, gc pauses or nodetool/cqlsh connectivity etc. We just started getting these exceptions in our application logs until we restarted Cassandra servers. This exception was reported randomly for different Cassandra servers in the datacenter and we had to restart each of them. In normal time each of these Cassandra server servers 10K read requests/seconds and hardly 10 write requests/seconds. When we encounter this problem read requests are dramatically affected to 2-3 K/seconds.
The replication factor of our cassandra datacenter is 3 and following is way we are making connections
Cluster.builder()
.addContactPoints(nodes)
.withQueryOptions(new QueryOptions().setConsistencyLevel(ConsistencyLevel.LOCAL_ONE))
.withLoadBalancingPolicy(DCAwareRoundRobinPolicy.builder().withLocalDc(localDatacenter).build())
.withSpeculativeExecutionPolicy(PerHostPercentileTracker.builder(13000).build())
.build()
EDIT:
We have observed before we start getting these exceptions, we getting the following WARN
level exceptions in our java application.
2018-04-03 23:40:06,456 WARN [cluster1-timeouter-0] com.datastax.driver.core.RequestHandler [RequestHandler.java:805] Not retrying statement because it is not idempotent (this message will be logged only once). Note that this version of the driver changes the default retry behavior for non-idempotent statements: they won't be automatically retried anymore. The driver marks statements non-idempotent by default, so you should explicitly call setIdempotent(true) if your statements are safe to retry. See https://docs.datastax.com/en/developer/java-driver/3.1/manual/retries/ for more details.
2018-04-04 00:04:24,856 WARN [cluster1-nio-worker-2] com.datastax.driver.core.PercentileTracker [PercentileTracker.java:108] Got request with latency of 16632 ms, which exceeds the configured maximum trackable value 13000
2018-04-04 00:04:24,858 WARN [cluster1-timeouter-0] com.datastax.driver.core.PercentileTracker [PercentileTracker.java:108] Got request with latency of 16712 ms, which exceeds the configured maximum trackable value 13000