The Sysdig Monitor Cloud service experienced multiple, simultaneous failures within our Cassandra cluster starting 10:53am UTC. While our Cassandra clusters are architected for high availability, in this particular case we experienced simultaneous failures of 2 Cassandra nodes that caused the cluster to become unreliable.
This in turn cascaded into intermittent issues with the rest of our data pipeline. We removed the instances experiencing issues from the cluster. The system began to return to healthy status at around 12:43pm UTC.
We are in the process of a deeper, root cause analysis to determine the precise cause of the failure of the Cassandra nodes in question. No data previously stored was lost or impacted.
Posted almost 2 years ago. Aug 09, 2017 - 15:22 UTC