I have the a timeout issue showing up with few partition keys on the same table. All queries were working fine until yesterday on the same partition keys. All other queries on the same table are just fine as expected.
I tried to query the same partition key on cqlsh but it is only successful only a few times.
I have 3 nodes in GCP and the keyspace has replication factor of 3. All nodes are showing up. I even tried to increase the read-timeout in cassandra.yaml to 20 sec.
What could be going on? Same query is returning different timeout errors.
cqlsh:us> select address, kind, id from devices_mac where address=0x6854f5fffbbf; OperationTimedOut: errors={'10.142.70.4': 'Client request timeout. See Session.execute_async'}, last_host=10.142.70.4
cqlsh:us> select address, kind, id from devices_mac where address=0x6854f5fffbbf; ReadFailure: Error from server: code=1300 [Replica(s) failed to execute read] message="Operation failed - received 0 responses and 2 failures" info={'failures': 2, 'received_responses': 0, 'required_responses': 1, 'consistency': 'ONE'}
cqlsh:us> select address, kind, id from devices_mac where address=0x6854f5fffbbf; ReadFailure: Error from server: code=1300 [Replica(s) failed to execute read] message="Operation failed - received 0 responses and 2 failures" info={'failures': 2, 'received_responses': 0, 'required_responses': 1, 'consistency': 'ONE'} cqlsh:us> select address, kind, id from devices_mac where address=0x6854f5fffbbf; ReadFailure: Error from server: code=1300 [Replica(s) failed to execute read] message="Operation failed - received 0 responses and 2 failures" info={'failures': 2, 'received_responses': 0, 'required_responses': 1, 'consistency': 'ONE'}
cqlsh:us> select address, kind, id from devices_mac where address=0x6854f5fffbbf;
address | kind | id ----------------+---------+-------------------------------------- 0x6854f5fffbbf | gateway | d5a807e0-b389-11e7-9c90-3bf17563ce39
(1 rows)
cqlsh:us> select address, kind, id from devices_mac where address=0x6854f5fffbbf;
address | kind | id ----------------+---------+-------------------------------------- 0x6854f5fffbbf | gateway | d5a807e0-b389-11e7-9c90-3bf17563ce39
(1 rows)
cqlsh:us> select address, kind, id from devices_mac where address=0x6854f5fffbbf; OperationTimedOut: errors={'10.142.70.4': 'Client request timeout. See Session.execute_async'}, last_host=10.142.70.4
cqlsh:us> tracing on;
Now Tracing is enabled
cqlsh:us> select address, kind, id from devices_mac where address=0x6854f5fffbbf;
address | kind | id
----------------+---------+--------------------------------------
0x6854f5fffbbf | gateway | d5a807e0-b389-11e7-9c90-3bf17563ce39
(1 rows)
Tracing session: 2d7a7510-c056-11e7-b420-1f591a32c1ad
activity | timestamp | source | source_elapsed | client
---------------------------------------------------------------------------------------------------------------+----------------------------+-------------+----------------+-------------
Execute CQL3 query | 2017-11-03 05:16:42.082000 | 10.142.70.4 | 0 | 10.142.70.4
Parsing select address, kind, id from devices_mac where address=0x6854f5fffbbf; [Native-Transport-Requests-1] | 2017-11-03 05:16:42.082000 | 10.142.70.4 | 427 | 10.142.70.4
Preparing statement [Native-Transport-Requests-1] | 2017-11-03 05:16:42.082000 | 10.142.70.4 | 671 | 10.142.70.4
reading data from /10.142.70.3 [Native-Transport-Requests-1] | 2017-11-03 05:16:42.083000 | 10.142.70.4 | 1135 | 10.142.70.4
Sending READ message to /10.142.70.3 [MessagingService-Outgoing-/10.142.70.3] | 2017-11-03 05:16:42.083000 | 10.142.70.4 | 1390 | 10.142.70.4
speculating read retry on /10.142.70.2 [Native-Transport-Requests-1] | 2017-11-03 05:16:42.083000 | 10.142.70.4 | 1619 | 10.142.70.4
Sending READ message to /10.142.70.2 [MessagingService-Outgoing-/10.142.70.2] | 2017-11-03 05:16:42.083000 | 10.142.70.4 | 1863 | 10.142.70.4
INTERNAL_RESPONSE message received from /10.142.70.3 [MessagingService-Incoming-/10.142.70.3] | 2017-11-03 05:16:42.085000 | 10.142.70.4 | 3035 | 10.142.70.4
Processing response from /10.142.70.3 [InternalResponseStage:252] | 2017-11-03 05:16:42.085000 | 10.142.70.4 | 3688 | 10.142.70.4
READ message received from /10.142.70.4 [MessagingService-Incoming-/10.142.70.4] | 2017-11-03 05:16:42.088000 | 10.142.70.3 | 36 | 10.142.70.4
Executing single-partition query on devices_mac [ReadStage-4] | 2017-11-03 05:16:42.088000 | 10.142.70.3 | 292 | 10.142.70.4
Acquiring sstable references [ReadStage-4] | 2017-11-03 05:16:42.089000 | 10.142.70.3 | 478 | 10.142.70.4
Bloom filter allows skipping sstable 89 [ReadStage-4] | 2017-11-03 05:16:42.089000 | 10.142.70.3 | 551 | 10.142.70.4
Key cache hit for sstable 88 [ReadStage-4] | 2017-11-03 05:16:42.089000 | 10.142.70.3 | 580 | 10.142.70.4
Skipped 0/2 non-slice-intersecting sstables, included 0 due to tombstones [ReadStage-4] | 2017-11-03 05:16:42.089000 | 10.142.70.3 | 651 | 10.142.70.4
Sending INTERNAL_RESPONSE message to /10.142.70.4 [MessagingService-Outgoing-/10.142.70.4] | 2017-11-03 05:16:42.089000 | 10.142.70.3 | 1043 | 10.142.70.4
READ message received from /10.142.70.4 [MessagingService-Incoming-/10.142.70.4] | 2017-11-03 05:16:42.137000 | 10.142.70.2 | 16431 | 10.142.70.4
Executing single-partition query on devices_mac [ReadStage-2] | 2017-11-03 05:16:42.140000 | 10.142.70.2 | 19140 | 10.142.70.4
Acquiring sstable references [ReadStage-2] | 2017-11-03 05:16:42.140000 | 10.142.70.2 | 19641 | 10.142.70.4
Bloom filter allows skipping sstable 79 [ReadStage-2] | 2017-11-03 05:16:42.141000 | 10.142.70.2 | 20340 | 10.142.70.4
Key cache hit for sstable 78 [ReadStage-2] | 2017-11-03 05:16:42.141000 | 10.142.70.2 | 21034 | 10.142.70.4
Key cache hit for sstable 77 [ReadStage-2] | 2017-11-03 05:16:42.142000 | 10.142.70.2 | 21281 | 10.142.70.4
Skipped 0/3 non-slice-intersecting sstables, included 0 due to tombstones [ReadStage-2] | 2017-11-03 05:16:42.142000 | 10.142.70.2 | 21759 | 10.142.70.4
REQUEST_RESPONSE message received from /10.142.70.2 [MessagingService-Incoming-/10.142.70.2] | 2017-11-03 05:16:42.151000 | 10.142.70.4 | 69022 | 10.142.70.4
Processing response from /10.142.70.2 [RequestResponseStage-2] | 2017-11-03 05:16:42.151000 | 10.142.70.4 | 69211 | 10.142.70.4
Merged data from memtables and 3 sstables [ReadStage-2] | 2017-11-03 05:16:42.153000 | 10.142.70.2 | 32443 | 10.142.70.4
Read 1 live and 1 tombstone cells [ReadStage-2] | 2017-11-03 05:16:42.153000 | 10.142.70.2 | 32792 | 10.142.70.4
Enqueuing response to /10.142.70.4 [ReadStage-2] | 2017-11-03 05:16:42.153000 | 10.142.70.2 | 33109 | 10.142.70.4
Sending REQUEST_RESPONSE message to /10.142.70.4 [MessagingService-Outgoing-/10.142.70.4] | 2017-11-03 05:16:42.154000 | 10.142.70.2 | 33905 | 10.142.70.4
Request complete | 2017-11-03 05:16:42.151744 | 10.142.70.4 | 69744 | 10.142.70.4
cqlsh:us> select address, kind, id from devices_mac where address=0x6854f5fffbbf;
OperationTimedOut: errors={'10.142.70.4': 'Client request timeout. See Session.execute[_async](timeout)'}, last_host=10.142.70.4
cqlsh:us> select address, kind, id from devices_mac where address=0x6854f5fffbbf;
OperationTimedOut: errors={'10.142.70.4': 'Client request timeout. See Session.execute[_async](timeout)'}, last_host=10.142.70.4
GC LOGS ----
2017-11-03T16:27:53.965+0000: 142177.812: Total time for which application threads were stopped: 0.0109614 seconds, Stopping threads took: 0.0000757 seconds
{Heap before GC invocations=4748 (full 33):
par new generation total 184320K, used 120338K [0x000000008b200000, 0x0000000097a00000, 0x0000000097a00000)
eden space 163840K, 73% used [0x000000008b200000, 0x0000000092764dc0, 0x0000000095200000)
from space 20480K, 0% used [0x0000000095200000, 0x000000009521fe20, 0x0000000096600000)
to space 20480K, 0% used [0x0000000096600000, 0x0000000096600000, 0x0000000097a00000)
concurrent mark-sweep generation total 1710080K, used 697432K [0x0000000097a00000, 0x0000000100000000, 0x0000000100000000)
Metaspace used 46344K, capacity 47706K, committed 47892K, reserved 1091584K
class space used 5827K, capacity 6111K, committed 6164K, reserved 1048576K
2017-11-03T16:27:54.857+0000: 142178.704: [GC (Allocation Failure) 2017-11-03T16:27:54.857+0000: 142178.704: [ParNew
Desired survivor size 10485760 bytes, new threshold 1 (max 1)
- age 1: 51336 bytes, 51336 total
: 120338K->80K(184320K), 0.0105980 secs] 817771K->697549K(1894400K), 0.0107087 secs] [Times: user=0.02 sys=0.00, real=0.01 secs]
Heap after GC invocations=4749 (full 33):
par new generation total 184320K, used 80K [0x000000008b200000, 0x0000000097a00000, 0x0000000097a00000)
eden space 163840K, 0% used [0x000000008b200000, 0x000000008b200000, 0x0000000095200000)
from space 20480K, 0% used [0x0000000096600000, 0x0000000096614070, 0x0000000097a00000)
to space 20480K, 0% used [0x0000000095200000, 0x0000000095200000, 0x0000000096600000)
concurrent mark-sweep generation total 1710080K, used 697469K [0x0000000097a00000, 0x0000000100000000, 0x0000000100000000)
Metaspace used 46344K, capacity 47706K, committed 47892K, reserved 1091584K
class space used 5827K, capacity 6111K, committed 6164K, reserved 1048576K
}
2017-11-03T16:27:54.868+0000: 142178.715: Total time for which application threads were stopped: 0.0115291 seconds, Stopping threads took: 0.0000898 seconds
{Heap before GC invocations=4749 (full 33):
par new generation total 184320K, used 163920K [0x000000008b200000, 0x0000000097a00000, 0x0000000097a00000)
eden space 163840K, 100% used [0x000000008b200000, 0x0000000095200000, 0x0000000095200000)
from space 20480K, 0% used [0x0000000096600000, 0x0000000096614070, 0x0000000097a00000)
to space 20480K, 0% used [0x0000000095200000, 0x0000000095200000, 0x0000000096600000)
concurrent mark-sweep generation total 1710080K, used 697469K [0x0000000097a00000, 0x0000000100000000, 0x0000000100000000)
Metaspace used 46344K, capacity 47706K, committed 47892K, reserved 1091584K
class space used 5827K, capacity 6111K, committed 6164K, reserved 1048576K
2017-11-03T16:28:05.336+0000: 142189.183: [GC (Allocation Failure) 2017-11-03T16:28:05.336+0000: 142189.183: [ParNew
Desired survivor size 10485760 bytes, new threshold 1 (max 1)
- age 1: 94096 bytes, 94096 total
: 163920K->115K(184320K), 0.0389448 secs] 861389K->771393K(1894400K), 0.0390719 secs] [Times: user=0.08 sys=0.00, real=0.04 secs]
CFSTATS for my table ---
Keyspace : us
Read Count: 26141
Read Latency: 0.12058234191499943 ms.
Write Count: 50140
Write Latency: 0.052054906262465096 ms.
Pending Flushes: 0
Table: devices_mac
SSTable count: 2
Space used (live): 223080
Space used (total): 223080
Space used by snapshots (total): 0
Off heap memory used (total): 1068
SSTable Compression Ratio: 0.21666133520615916
Number of keys (estimate): 615
Memtable cell count: 2
Memtable data size: 8719
Memtable off heap memory used: 0
Memtable switch count: 3
Local read count: 186
Local read latency: NaN ms
Local write count: 978
Local write latency: 0.208 ms
Pending flushes: 0
Percent repaired: 0.0
Bloom filter false positives: 0
Bloom filter false ratio: 0.00000
Bloom filter space used: 848
Bloom filter off heap memory used: 832
Index summary off heap memory used: 108
Compression metadata off heap memory used: 128
Compacted partition minimum bytes: 643
Compacted partition maximum bytes: 14237
Compacted partition mean bytes: 1372
Average live cells per slice (last five minutes): NaN
Maximum live cells per slice (last five minutes): 0
Average tombstones per slice (last five minutes): NaN
Maximum tombstones per slice (last five minutes): 0
Dropped Mutations: 0
CFHISTOGRAMS for the table
us/devices_mac histograms
Percentile SSTables Write Latency Read Latency Partition Size Cell Count
(micros) (micros) (bytes)
50% 0.00 0.00 0.00 1331 60
75% 0.00 0.00 0.00 1597 60
95% 0.00 0.00 0.00 1597 60
98% 0.00 0.00 0.00 2759 124
99% 0.00 0.00 0.00 3311 124
Min 0.00 0.00 0.00 643 21
Max 0.00 0.00 0.00 14237 149