1

Chaos engineering practices are becoming very widely used. One common example is Netflix' own Chaos Monkey. However, Chaos Monkey is often run ad-hoc against random targets. I'm curious how chaos experiments might work in a typical CI/CD pipeline to enhance a specific service's resiliency.

  • Since chaos experiments (usually) require a fully functional environment, when would they run? Would it run parallel to testing, or downstream?
  • Would you run a chaos experiment with every commit, or just some?
  • How long would allow the chaos experiments to run? A 60 minute CPU spike might interfere with a "fail fast" approach, for example.
  • Would a chaos experiment ever fail the pipeline? What would constitute a 'failure'?
Jay Spang
  • 2,013
  • 7
  • 27
  • 32

2 Answers2

1

We are just getting started with our chaos engineering efforts, but I'll offer some thoughts regarding your questions.

There are at least three distinct classes of experiment:

  • Instance/container kills that we expect the underlying infrastructure to handle automatically.
  • Higher-level but fairly localized failures like slow or unavailable dependencies.
  • Large-scale failures like data center or region down.

For a build pipeline the sweet spot would be in the middle there (i.e. higher-level but localized failures), because usually the software itself plays a role in responding to the failure. For example the software might include a circuit breaker that trips, throttling, automated failover, etc. If those are software functions, then they can either work or not work, and the build should uncover that.

To the extent that resiliency to failure is a system requirement, then yeah, a failed experiment would fail the pipeline. Suppose for instance that build 392 has a correctly working circuit breaker, and that build 393 doesn't. That would be a failure since the build goes from meeting the requirement to not.

0

We usually have some chaos experiments, like large-scale failures outside the pipeline.

During the build pipeline, we usually combine chaos experiments with a short performance test to simulate activity and then kill some instances/container to check the resilience of the system. And fail if the system is not able to recover.

  • Your answer could be improved with additional supporting information. Please [edit] to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Aug 05 '22 at 01:07