Cassandra split brain partition

Question

We are running a 6-node cassandra cluster across two AWS availability zones (ap-southeast-1 and ap-southeast-2).

After running happily for several months, the cluster was given a rolling restart to fix a hung repair, and now each group thinks the other is down.

Cluster Information:
    Name: MegaportGlobal
    Snitch: org.apache.cassandra.locator.DynamicEndpointSnitch
    Partitioner: org.apache.cassandra.dht.Murmur3Partitioner
    Schema versions:
        220727fa-88d2-366f-9473-777e32744c37: [10.5.13.117, 10.5.12.245, 10.5.13.93]

        UNREACHABLE: [10.4.0.112, 10.4.0.169, 10.4.2.186]

Cluster Information:
    Name: MegaportGlobal
    Snitch: org.apache.cassandra.locator.DynamicEndpointSnitch
    Partitioner: org.apache.cassandra.dht.Murmur3Partitioner
    Schema versions:
        3932d237-b907-3ef8-95bc-4276dc7f32e6: [10.4.0.112, 10.4.0.169, 10.4.2.186]

        UNREACHABLE: [10.5.13.117, 10.5.12.245, 10.5.13.93]

From Sydney, 'nodetool status' reports that most Singapore nodes are down:

Datacenter: ap-southeast-2
==========================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address      Load       Tokens  Owns    Host ID                               Rack
UN  10.4.0.112   9.04 GB    256     ?       b9c19de4-4939-4112-bf07-d136d8a57b57  2a
UN  10.4.0.169   9.34 GB    256     ?       2d7c3ac4-ae94-43d6-9afe-7d421c06b951  2a
UN  10.4.2.186   10.72 GB   256     ?       4dc8b155-8f9a-4532-86ec-d958ac207f40  2b
Datacenter: ap-southeast-1
==========================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address      Load       Tokens  Owns    Host ID                               Rack
UN  10.5.13.117  9.45 GB    256     ?       324ee189-3e72-465f-987f-cbc9f7bf740b  1a
DN  10.5.12.245  10.25 GB   256     ?       bee281c9-715b-4134-a033-00479a390f1e  1b
DN  10.5.13.93   12.29 GB   256     ?       a8262244-91bb-458f-9603-f8c8fe455924  1a

But from Singapore, all Sydney nodes are reported as down:

ap-southeast-2
==========================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address      Load       Tokens  Owns    Host ID                               Rack
DN  10.4.0.112   8.91 GB    256     ?       b9c19de4-4939-4112-bf07-d136d8a57b57  2a
DN  10.4.0.169   ?          256     ?       2d7c3ac4-ae94-43d6-9afe-7d421c06b951  2a
DN  10.4.2.186   ?          256     ?       4dc8b155-8f9a-4532-86ec-d958ac207f40  2b
Datacenter: ap-southeast-1
==========================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address      Load       Tokens  Owns    Host ID                               Rack
UN  10.5.13.117  9.45 GB    256     ?       324ee189-3e72-465f-987f-cbc9f7bf740b  1a
UN  10.5.12.245  10.25 GB   256     ?       bee281c9-715b-4134-a033-00479a390f1e  1b
UN  10.5.13.93   12.29 GB   256     ?       a8262244-91bb-458f-9603-f8c8fe455924  1a

Even more confusing, 'nodetool gossipinfo' executed in Sydney reports that everything is fine with Status - Normal:

/10.5.13.117
  generation:1440735653
  heartbeat:724504
  SEVERITY:0.0
  DC:ap-southeast-1
  LOAD:1.0149565738E10
  SCHEMA:7bf335ee-61ae-36c6-a902-c70d785ec7a3
  RACK:1a
  STATUS:NORMAL,-1059943672916788858
  RELEASE_VERSION:2.1.6
  NET_VERSION:8
  RPC_ADDRESS:10.5.13.117
  INTERNAL_IP:10.5.13.117
  HOST_ID:324ee189-3e72-465f-987f-cbc9f7bf740b
/10.5.12.245
  generation:1440734497
  heartbeat:728014
  SEVERITY:0.0
  DC:ap-southeast-1
  LOAD:1.100647505E10
  SCHEMA:7bf335ee-61ae-36c6-a902-c70d785ec7a3
  RACK:1b
  STATUS:NORMAL,-1029869455226513030
  RELEASE_VERSION:2.1.6
  NET_VERSION:8
  RPC_ADDRESS:10.5.12.245
  INTERNAL_IP:10.5.12.245
  HOST_ID:bee281c9-715b-4134-a033-00479a390f1e
/10.4.0.112
  generation:1440973751
  heartbeat:4135
  SEVERITY:0.0
  DC:ap-southeast-2
  LOAD:9.70297176E9
  SCHEMA:7bf335ee-61ae-36c6-a902-c70d785ec7a3
  RACK:2a
  RELEASE_VERSION:2.1.6
  STATUS:NORMAL,-1016623069114845926
  NET_VERSION:8
  RPC_ADDRESS:10.4.0.112
  INTERNAL_IP:10.4.0.112
  HOST_ID:b9c19de4-4939-4112-bf07-d136d8a57b57
/10.5.13.93
  generation:1440734532
  heartbeat:727909
  SEVERITY:0.0
  DC:ap-southeast-1
  LOAD:1.3197536002E10
  SCHEMA:7bf335ee-61ae-36c6-a902-c70d785ec7a3
  RACK:1a
  STATUS:NORMAL,-1021689296016263011
  RELEASE_VERSION:2.1.6
  NET_VERSION:8
  RPC_ADDRESS:10.5.13.93
  INTERNAL_IP:10.5.13.93
  HOST_ID:a8262244-91bb-458f-9603-f8c8fe455924
/10.4.0.169
  generation:1440974511
  heartbeat:1832
  SEVERITY:0.0
  DC:ap-southeast-2
  LOAD:1.0023502338E10
  SCHEMA:7bf335ee-61ae-36c6-a902-c70d785ec7a3
  RACK:2a
  RELEASE_VERSION:2.1.6
  STATUS:NORMAL,-1004223692762353764
  NET_VERSION:8
  RPC_ADDRESS:10.4.0.169
  INTERNAL_IP:10.4.0.169
  HOST_ID:2d7c3ac4-ae94-43d6-9afe-7d421c06b951
/10.4.2.186
  generation:1440734382
  heartbeat:730171
  SEVERITY:0.0
  DC:ap-southeast-2
  LOAD:1.1507595081E10
  SCHEMA:7bf335ee-61ae-36c6-a902-c70d785ec7a3
  RACK:2b
  STATUS:NORMAL,-10099894685483463
  RELEASE_VERSION:2.1.6
  NET_VERSION:8
  RPC_ADDRESS:10.4.2.186
  INTERNAL_IP:10.4.2.186
  HOST_ID:4dc8b155-8f9a-4532-86ec-d958ac207f40

The same command executed in Singapore does not include a STATUS for any nodes in Sydney:

/10.5.12.245
  generation:1440974710
  heartbeat:1372
  SEVERITY:0.0
  LOAD:1.100835806E10
  RPC_ADDRESS:10.5.12.245
  NET_VERSION:8
  SCHEMA:7bf335ee-61ae-36c6-a902-c70d785ec7a3
  RELEASE_VERSION:2.1.6
  STATUS:NORMAL,-1029869455226513030
  DC:ap-southeast-1
  RACK:1b
  INTERNAL_IP:10.5.12.245
  HOST_ID:bee281c9-715b-4134-a033-00479a390f1e
/10.5.13.117
  generation:1440974648
  heartbeat:1561
  SEVERITY:0.0
  LOAD:1.0149992022E10
  RPC_ADDRESS:10.5.13.117
  NET_VERSION:8
  SCHEMA:7bf335ee-61ae-36c6-a902-c70d785ec7a3
  RELEASE_VERSION:2.1.6
  STATUS:NORMAL,-1059943672916788858
  DC:ap-southeast-1
  RACK:1a
  HOST_ID:324ee189-3e72-465f-987f-cbc9f7bf740b
  INTERNAL_IP:10.5.13.117
/10.4.0.112
  generation:1440735420
  heartbeat:23
  SEVERITY:0.0
  LOAD:9.570546197E9
  RPC_ADDRESS:10.4.0.112
  NET_VERSION:8
  SCHEMA:7bf335ee-61ae-36c6-a902-c70d785ec7a3
  RELEASE_VERSION:2.1.6
  DC:ap-southeast-2
  RACK:2a
  INTERNAL_IP:10.4.0.112
  HOST_ID:b9c19de4-4939-4112-bf07-d136d8a57b57
/10.5.13.93
  generation:1440734532
  heartbeat:729862
  SEVERITY:0.0
  LOAD:1.3197536002E10
  RPC_ADDRESS:10.5.13.93
  NET_VERSION:8
  SCHEMA:7bf335ee-61ae-36c6-a902-c70d785ec7a3
  RELEASE_VERSION:2.1.6
  STATUS:NORMAL,-1021689296016263011
  DC:ap-southeast-1
  RACK:1a
  INTERNAL_IP:10.5.13.93
  HOST_ID:a8262244-91bb-458f-9603-f8c8fe455924
/10.4.0.169
  generation:1440974511
  heartbeat:15
  SEVERITY:0.5076141953468323
  RPC_ADDRESS:10.4.0.169
  NET_VERSION:8
  SCHEMA:7bf335ee-61ae-36c6-a902-c70d785ec7a3
  RELEASE_VERSION:2.1.6
  DC:ap-southeast-2
  RACK:2a
  INTERNAL_IP:10.4.0.169
  HOST_ID:2d7c3ac4-ae94-43d6-9afe-7d421c06b951
/10.4.2.186
  generation:1440734382
  heartbeat:15
  SEVERITY:0.0
  RPC_ADDRESS:10.4.2.186
  NET_VERSION:8
  SCHEMA:7bf335ee-61ae-36c6-a902-c70d785ec7a3
  RELEASE_VERSION:2.1.6
  DC:ap-southeast-2
  RACK:2b
  INTERNAL_IP:10.4.2.186
  HOST_ID:4dc8b155-8f9a-4532-86ec-d958ac207f40

During restart, each node can see the remote DC for a little while:

INFO  [GossipStage:1] 2015-08-31 10:53:07,638 OutboundTcpConnection.java:97 - OutboundTcpConnection using coalescing strategy DISABLED
INFO  [HANDSHAKE-/10.4.2.186] 2015-08-31 10:53:08,267 OutboundTcpConnection.java:485 - Handshaking version with /10.4.2.186
INFO  [HANDSHAKE-/10.4.0.169] 2015-08-31 10:53:08,287 OutboundTcpConnection.java:485 - Handshaking version with /10.4.0.169
INFO  [HANDSHAKE-/10.5.12.245] 2015-08-31 10:53:08,391 OutboundTcpConnection.java:485 - Handshaking version with /10.5.12.245
INFO  [HANDSHAKE-/10.5.13.93] 2015-08-31 10:53:08,498 OutboundTcpConnection.java:485 - Handshaking version with /10.5.13.93
INFO  [GossipStage:1] 2015-08-31 10:53:08,537 Gossiper.java:987 - Node /10.5.12.245 has restarted, now UP
INFO  [HANDSHAKE-/10.5.13.117] 2015-08-31 10:53:08,537 OutboundTcpConnection.java:485 - Handshaking version with /10.5.13.117
INFO  [GossipStage:1] 2015-08-31 10:53:08,656 StorageService.java:1642 - Node /10.5.12.245 state jump to normal
INFO  [GossipStage:1] 2015-08-31 10:53:08,820 Gossiper.java:987 - Node /10.5.13.117 has restarted, now UP
INFO  [GossipStage:1] 2015-08-31 10:53:08,852 Gossiper.java:987 - Node /10.5.13.93 has restarted, now UP
INFO  [SharedPool-Worker-33] 2015-08-31 10:53:08,907 Gossiper.java:954 - InetAddress /10.5.12.245 is now UP
INFO  [GossipStage:1] 2015-08-31 10:53:08,947 StorageService.java:1642 - Node /10.5.13.93 state jump to normal
INFO  [GossipStage:1] 2015-08-31 10:53:09,007 Gossiper.java:987 - Node /10.4.0.169 has restarted, now UP
WARN  [GossipTasks:1] 2015-08-31 10:53:09,123 FailureDetector.java:251 - Not marking nodes down due to local pause of 7948322997 > 5000000000
INFO  [GossipStage:1] 2015-08-31 10:53:09,192 StorageService.java:1642 - Node /10.4.0.169 state jump to normal
INFO  [HANDSHAKE-/10.5.12.245] 2015-08-31 10:53:09,199 OutboundTcpConnection.java:485 - Handshaking version with /10.5.12.245
INFO  [GossipStage:1] 2015-08-31 10:53:09,203 Gossiper.java:987 - Node /10.4.2.186 has restarted, now UP
INFO  [GossipStage:1] 2015-08-31 10:53:09,206 StorageService.java:1642 - Node /10.4.2.186 state jump to normal
INFO  [SharedPool-Worker-34] 2015-08-31 10:53:09,215 Gossiper.java:954 - InetAddress /10.5.13.93 is now UP
INFO  [SharedPool-Worker-33] 2015-08-31 10:53:09,259 Gossiper.java:954 - InetAddress /10.5.13.117 is now UP
INFO  [SharedPool-Worker-33] 2015-08-31 10:53:09,259 Gossiper.java:954 - InetAddress /10.4.0.169 is now UP
INFO  [SharedPool-Worker-33] 2015-08-31 10:53:09,259 Gossiper.java:954 - InetAddress /10.4.2.186 is now UP
INFO  [GossipStage:1] 2015-08-31 10:53:09,296 StorageService.java:1642 - Node /10.4.0.169 state jump to normal
INFO  [GossipStage:1] 2015-08-31 10:53:09,491 StorageService.java:1642 - Node /10.5.12.245 state jump to normal
INFO  [HANDSHAKE-/10.5.13.117] 2015-08-31 10:53:09,509 OutboundTcpConnection.java:485 - Handshaking version with /10.5.13.117
INFO  [GossipStage:1] 2015-08-31 10:53:09,511 StorageService.java:1642 - Node /10.5.13.93 state jump to normal
INFO  [HANDSHAKE-/10.5.13.93] 2015-08-31 10:53:09,538 OutboundTcpConnection.java:485 - Handshaking version with /10.5.13.93

Then, without any errors, the nodes are marked as down:

INFO  [GossipTasks:1] 2015-08-31 10:53:34,410 Gossiper.java:968 - InetAddress /10.5.13.117 is now DOWN
INFO  [GossipTasks:1] 2015-08-31 10:53:34,411 Gossiper.java:968 - InetAddress /10.5.12.245 is now DOWN
INFO  [GossipTasks:1] 2015-08-31 10:53:34,411 Gossiper.java:968 - InetAddress /10.5.13.93 is now DOWN

We have tried multiple restarts, but the behaviour remains the same.

*edit

It looks to be related to the Gossip protocol... Turning on extra debug shows that the PHI values are steadily increasing:

TRACE [GossipTasks:1] 2015-08-31 16:46:44,706 FailureDetector.java:262 - PHI for /10.4.0.112 : 2.9395029255
TRACE [GossipTasks:1] 2015-08-31 16:46:45,727 FailureDetector.java:262 - PHI for /10.4.0.112 : 3.449690761
TRACE [GossipTasks:1] 2015-08-31 16:46:46,728 FailureDetector.java:262 - PHI for /10.4.0.112 : 3.95049114
TRACE [GossipTasks:1] 2015-08-31 16:46:47,730 FailureDetector.java:262 - PHI for /10.4.0.112 : 4.451317456
TRACE [GossipTasks:1] 2015-08-31 16:46:48,732 FailureDetector.java:262 - PHI for /10.4.0.112 : 4.952114357
TRACE [GossipTasks:1] 2015-08-31 16:46:49,733 FailureDetector.java:262 - PHI for /10.4.0.112 : 5.4529339645
TRACE [GossipTasks:1] 2015-08-31 16:46:50,735 FailureDetector.java:262 - PHI for /10.4.0.112 : 5.953951289
TRACE [GossipTasks:1] 2015-08-31 16:46:51,737 FailureDetector.java:262 - PHI for /10.4.0.112 : 6.4547808165
TRACE [GossipTasks:1] 2015-08-31 16:46:52,738 FailureDetector.java:262 - PHI for /10.4.0.112 : 6.955600038
TRACE [GossipTasks:1] 2015-08-31 16:46:53,740 FailureDetector.java:262 - PHI for /10.4.0.112 : 7.456422601
TRACE [GossipTasks:1] 2015-08-31 16:46:54,742 FailureDetector.java:262 - PHI for /10.4.0.112 : 7.957303284
TRACE [GossipTasks:1] 2015-08-31 16:46:55,751 FailureDetector.java:262 - PHI for /10.4.0.112 : 8.461658576
TRACE [GossipTasks:1] 2015-08-31 16:46:56,755 FailureDetector.java:262 - PHI for /10.4.0.112 : 8.9636610545
TRACE [GossipTasks:1] 2015-08-31 16:46:57,763 FailureDetector.java:262 - PHI for /10.4.0.112 : 9.4676926445

The PHI values steadily increase after a restart, until they exceed the failure threshold and are marked as DOWN.

Any suggestions on how to proceed?

The status shown by `nodetool gossipinfo` will only indicate the bootstrapping/normal/leaving life-cycle status as announced by the respective node via gossip. It corresponds to the "State=Normal/Leaving/Joining/Moving" field in `nodetool status` but does not indicate if the node is reachable by all other nodes ("Status Up/Down). In your case there seem to be connectivity issues between nodes that could be caused by network problems. You should run `tcpdump` on two nodes trying to gossip between each other to see if thats being the case. — Stefan Podkowinski, Aug 31 '15 at 08:54

score 2 · Answer 1 · answered Sep 01 '15 at 04:24

2

For a laggy network, raise the phi failure detection threshold to 12 or 15. This is commonly required in AWS, especially cross region.

answered Sep 01 '15 at 04:24

Jeff Jirsa

4,391
11
24

We have increased it to 12 already, as per the docs. The PHI value starts out low, but grows rapidly and linearly over time and exceeds the threshold in about 30 seconds. It gets into the 1000s very quickly, so even a larger value is not going to help. Current thinking is to purge the gossip state and do a whole cluster restart... – Adam Wells Sep 01 '15 at 04:46
Have you done a tcpdump to ensure that they're really communicating? The phi value should continue increasing each time a heartbeat is missed - if it climbs indefinitely, it's not communicating properly? Presumably the security groups must be correct, but tcpdump between nodes on 7000 to see if it's dropping tcp streams? – Jeff Jirsa Sep 01 '15 at 05:01
we're run tcdump, and we can see communication. In the C* logs (TRACE) I can see GossipDigestSynMessage, followed by GossipDigestAckMessage, but no GossipDigestAck2Message... Perhaps an MTU issue? Will need to spend more time on tcpdump... – Adam Wells Sep 01 '15 at 11:02

score 2 · Answer 2 · answered Sep 04 '15 at 05:59

The problem turned out to be the network links into AWS, specifically the network MTU. Due to a subtle problem with our routing configuration, the data path became asymmetrical between Sydney and Singapore AWS.

I guess the lesson learned is that if the modes are happy within a DC and not between DCs, then it is likely the network, and things like MTU matter, even if pings and telnet etc look OK.

Thanks to Jeff and Stefan for your input - I'll buy you a beer if you ever find yourself in Brisbane!

Cassandra split brain partition

2 Answers2