Chaos engineering practices are becoming very widely used. One common example is Netflix' own Chaos Monkey. However, Chaos Monkey is often run ad-hoc against random targets. I'm curious how chaos experiments might work in a typical CI/CD pipeline to enhance a specific service's resiliency.
- Since chaos experiments (usually) require a fully functional environment, when would they run? Would it run parallel to testing, or downstream?
- Would you run a chaos experiment with every commit, or just some?
- How long would allow the chaos experiments to run? A 60 minute CPU spike might interfere with a "fail fast" approach, for example.
- Would a chaos experiment ever fail the pipeline? What would constitute a 'failure'?