Developed by the software engineers at Netflix, chaos engineering is the increasingly popular distributed software system testing methodology, designed to unearth failures before they become outages. How is this achieved? By deliberately – yet carefully – injecting failures into software systems in production to test how they perform under stress. In short, chaos engineers break things on purpose in order to learn how to make them more resilient.
This is accomplished by conducting chaos experiments, in which faults or high-stress scenarios are injected to test for a reaction. These experiments can be things like introducing latency or errors, simulating a server or datacenter failure, a network outage, large traffic spikes, or any other unpredictable circumstance that could lead to service outages. The ultimate goal is to uncover weaknesses, learn how your system behaves in the event of disaster, and generate new information about how systems as a whole react when individual components fail.
A Brief History
Chaos engineering is a relatively new approach to software quality assurance (QA) and software testing. It was first pioneered by the team at Netflix about a decade ago when the subscription streaming service began transitioning from its own data centers to the public cloud. The team quickly identified a need to create services with higher resiliency in this new cloud architecture. To that end, in 2010, they created Chaos Monkey – an automated testing tool that randomly chooses a server and disables it during its usual hours of activity – and subsequently released it as open-source software two years later. As former VP of Product Engineering at Netflix John Ciancutti put on the Netflix Tech Blog, “The Chaos Monkey’s job is to randomly kill instances and services within our architecture. If we aren’t constantly testing our ability to succeed despite failure, then it isn’t likely to work when it matters most – in the event of an unexpected outage.”
Since then, Netflix has developed a whole Simian Army, consisting of additional software testing tools on top of Chaos Monkey – Chaos Gorilla, Latency Monkey, Security Monkey, Doctor Monkey, etc. – which together enable testing of further failure states in order to build resilience to those as well.
Today, chaos engineering is on the rise, with many large tech companies – including Twilio, Facebook, Google, Microsoft, Amazon, and LinkedIn – adopting the practice to better understand their distributed systems and architectures.
So – that’s the history. But is chaos engineering right for organization? To answer this question, you will need to gain a comprehensive understanding of this groundbreaking approach to software testing. To help you do that, we’ve penned our own guide – ‘A Complete Introduction to Chaos Engineering’ – but today, we thought we’d introduce you to some crucial further reading, as well.
Below, we’ve put together a list of five great chaos engineering blogs and resources that are essential reading if you want learn more about getting started with chaos experiments. We’ve chosen articles that are insightful, well-written, easy-to-digest, and will pave your way to a better understanding of how and where chaos engineering will be a boon to your organization.
Let’s dive in…
- Principles of Chaos Engineering
It’s always best to begin at the beginning. The Principles of Chaos Engineering is a community-maintained document outlining the fundamentals of chaos – what it is, why we need it, and how more resilient and secure systems are built with it.
From chaos in practice to advanced techniques, chaos is defined succinctly and clearly, and the document lists all the core principles agreed upon by the chaos community.
“An empirical, systems-based approach addresses the chaos in distributed systems at scale and builds confidence in the ability of those systems to withstand realistic conditions. We learn about the behavior of a distributed system by observing it during a controlled experiment. We call this Chaos Engineering.”
- Testing in Production: Yes, You Can (and Should)
Chaos engineering, of course, means testing in production (TiP) – i.e. performing various tests in a production state or live environment. This can, of course, seem a little scary in the context of chaos – where engineers are deliberately injecting failure into a live environment, and thereby impacting the end-user experience of an application in one way or another.
But, as Charity Majors, CEO at monitoring tool vendor Honeycomb, puts it her blog post – ‘Testing in Production: Yes, You Can (and Should)’ – for opensource.com, failure is not your enemy, but your friend.
“Embrace failure. Chaos and failure are your friends. The issue is not if you will fail, it is when you will fail, and whether you will notice. It’s between whether it will annoy all of your users because the entire site is down, or if it will annoy only a few users until you fix it at your leisure the next morning. Once upon a time, these were optional skills, even specialties. Not anymore. These are table stakes in your new career as a distributed systems engineer.”
- Chaos Engineering – Withstanding Turbulent Conditions in Production
Of course, testing in production is a little like playing with fire. You’re introducing failure to a real-world setting, which means that things can go wrong. Naturally, in a chaos experiment, your aim will be to keep your experiment contained by isolating an experimental group, and exposing only that group to a simulated real-word failure event, such as a server crash, or a large spike in traffic.
However, chaos engineering is not about creating chaos, rather preventing it, and the whole purpose of testing in production is to uncover weaknesses to be fixed, and thereby build up the resilience of your system. In this excellent post – ‘Withstanding Turbulent Conditions in Production’ – Codecentric goes into great detail about designing robust chaos experiments that are at once safe and insightful.
“The aim of Chaos Engineering is to run it in production, always being in control of the situation and keeping customers unaffected.”
- Chaos Engineering Is Not Just Tools – It’s Culture
Failure as a Service provider Gremlin knows a thing or two about chaos. Perhaps most importantly that chaos tools are just tools – in order to find success with chaos, organizations need to embrace failure as a culture.
In the company’s excellent post ‘Chaos Engineering Is Not Just Tools – It’s Culture’, Gremlin illustrates the importance of chaos being embraced by the whole organization – not just isolated teams.
“Make sure everyone has full access to all chaos and monitoring tools so anyone can see what experiments are running and halt them at any time. At no point should anyone run risky experiments without warning others. Though if someone breaks that rule – and some day, they will – everyone needs access to the kill switch. This is the practice of Chaos Engineering. Yes, each service team may curate its own documentation and architecture diagrams from the comfort of its own silo, yet these documents – as current as anyone can keep them – are no substitute for collaborative practice.”
- The Limitations of Chaos Engineering
While chaos is undoubtedly a great tool for improving the resilience of your system, it’s not a panacea. This is something that freelance solutions architect Mathias Lafeldt explains passionately in his Medium post ‘The Limitations of Chaos Engineering’.
The truth is, chaos may not always be the right fit for your company, and if you’re looking a balanced view on the practice’s limitations, you should definitely check out Lafeldt’s wise words of warning before jumping in.
To be clear, Lafeldt is a chaos advocate – but, as he eloquently puts it, “Being an advocate of something doesn’t mean you should close your eyes to its downsides and limitations. In fact, the most skilled engineers are well aware of the pros and cons of their favorite tool or method, and consider them carefully. […] Chaos Engineering is a means to an end, not an end in itself. Experimenting on a distributed system is of great worth, but what matters, in the end, is the production service you aim to improve in the first place. Breaking things is a ton of fun, I can attest to that, but as long as you don’t feed results back – by fixing flaws, tweaking runbooks, training people – your chaos experiments are rarely more than a time killer.”
Chaos engineering isn’t just a buzzword – it’s here to stay, and every software engineer should, at the very least, be familiar the basics of proactive failure testing as means to create better, more robust, and more resilient systems. If you want to improve your knowledge, then you need to start reading. The blog posts and resources highlighted above are a great place to start your chaos journey, but you’ll need to keep exploring, keep reading, and eventually start seriously considering whether chaos can help you build the one thing your system needs most – resilience.