Chaos Engineering

It's a form of experimentation which identifies unknown behaviors of a distributed system when put through different forms of failures.

It isn't testing. Testing is ensuring system behaves as expected in certain conditions i.e. situations for which the system is designed.

Chaos is about increasing load, latency, communication issues, failure of nodes, hardware failures, network unavailability, etc between different components of a distributed system. Chaos experiments are situations for which the system wasn't designed but it should perform to expectations.

Distributed systems, where components query each other with spaghetti call graphs, become too complex for an architect to hold the understanding in his head. Neural networks, machine intelligence algo's and similar fields are good examples of these systems. No one human being can possibly comprehend the affects of a change in one component to the entire system. This is where chaos engineering allows us to uncover new properties of the system.

Injecting failure is expected to cause dependent services to fail. Often, unexpected services fail which leads us to understand that the failure domain was larger/different than expected by the designers. It brings out a better understanding of the system design.

Often systems are perceived to be too critical to undergo chaos experiments. For example, self driving car systems or defense missile systems. Here the experiments need to be controlled, smaller blast radius, self aborting experiments which measure deviance from the steady state and abort when the threshold is reached. Obviously before such systems are subject to chaos experiments, the system needs to be tested and a high level of confidence established in its resiliency.

Certain industries, like healthcare and defense can argue that chaos experimentation can cause loss of lives. However, there are always ways for them to think beyond this excuse and find ways to strengthen their systems. One way is to use the Backup or DR systems and have them behave like production, then run experiments there. Another stream of though is to put some users/customers at slight inconvenience to ensure that the system will perform well enough for the majority of users.