1

We have a 2 node cassandra cluster and we stopped and restarted one of them. During this time interval, the graphs in opscenter showed something like this:

opscenter

We restarted the node with the orange line. I wonder why there's a break in this graph. I ask this because the app was working fine and data was being written to the node with the blue line while the other one was being restarted.

Ankush92
  • 401
  • 1
  • 9
  • 20

1 Answers1

1

There are two likely explanations for this.

  1. OpsCenter stores some of the information on the cluster being monitored (by default). Depending on the replication strategy and replication factor of the OpsCenter keyspace, the timestamp in question may have been in a partition managed by the down node.
  2. Something to do with the restart may have temporarily disrupted the agent component that monitors and stores the information resulting in that information not being captured.

1 seems most likely given that the blue node metrics resume while the orange node metrics suggest the orange node is still down. If 1 is the case, then the data will be delivered to the orange node (via hinted handoffs) and become available once that node finishes rebooting. The graph should show the updated values after that, although refreshing the UI may be required. More nodes would make this kind of failure less likely, and a higher RF would make it very unlikely (practically impossible).

If time and refreshing the UI do not resolve the gap, then 2 is the most likely culprit and could be indicative of a bug in the metric recording mechanism. It would be worthwhile reporting it as such.

mildewey
  • 414
  • 2
  • 5