Developed by the software engineers at Netflix, chaos engineering is the increasingly popular distributed software system testing methodology, designed to unearth failures before they become outages. How is this achieved? By deliberately โ yet carefully โ injecting failures into software systems in production to test how they perform under stress. In short, chaos engineers break things on purpose in order to learn how to make them more resilient.
Table of Contents
ToggleThis is accomplished by conducting chaos experiments, in which faults or high-stress scenarios are injected to test for a reaction. These experiments can be things like introducing latency or errors, simulating a server or datacenter failure, a network outage, large traffic spikes, or any other unpredictable circumstance that could lead to service outages. The ultimate goal is to uncover weaknesses, learn how your system behaves in the event of disaster, and generate new information about how systems as a whole react when individual components fail.
A Brief History
Chaos engineering is a relatively new approach to software quality assurance (QA) and software testing. It was first pioneered by the team at Netflix about a decade ago when the subscription streaming service began transitioning from its own data centers to the public cloud. The team quickly identified a need to create services with higher resiliency in this new cloud architecture. To that end, in 2010, they created Chaos Monkey โ an automated testing tool that randomly chooses a server and disables it during its usual hours of activity โ and subsequently released it as open-source software two years later. As former VP of Product Engineering at Netflix John Ciancutti put on the Netflix Tech Blog, โThe Chaos Monkeyโs job is to randomly kill instances and services within our architecture. If we arenโt constantly testing our ability to succeed despite failure, then it isnโt likely to work when it matters mostโ โ โin the event of an unexpected outage.โ
Since then, Netflix has developed a whole Simian Army, consisting of additional software testing tools on top of Chaos Monkey โ Chaos Gorilla, Latency Monkey, Security Monkey, Doctor Monkey, etc. โ which together enable testing of further failure states in order to build resilience to those as well.
Today, chaos engineering is on the rise, with many large tech companies โ including Twilio, Facebook, Google, Microsoft, Amazon, and LinkedIn โ adopting the practice to better understand their distributed systems and architectures.
So โ thatโs the history. But is chaos engineering right for organization? To answer this question, you will need to gain a comprehensive understanding of this groundbreaking approach to software testing. To help you do that, weโve penned our own guide โ โA Complete Introduction to Chaos Engineeringโ โ but today, we thought weโd introduce you to some crucial further reading, as well.
Below, weโve put together a list of five great chaos engineering blogs and resources that are essential reading if you want learn more about getting started with chaos experiments. Weโve chosen articles that are insightful, well-written, easy-to-digest, and will pave your way to a better understanding of how and where chaos engineering will be a boon to your organization.
Letโs dive inโฆ
- Principles of Chaos Engineering
Itโs always best to begin at the beginning. The Principles of Chaos Engineering is a community-maintained document outlining the fundamentals of chaos โ what it is, why we need it, and how more resilient and secure systems are built with it.
From chaos in practice to advanced techniques, chaos is defined succinctly and clearly, and the document lists all the core principles agreed upon by the chaos community.
โAn empirical,ย systems-based approachย addresses the chaos in distributed systems at scale and builds confidence in the ability of those systems to withstand realistic conditions.ย We learn about the behavior of a distributed system byย observingย it during a controlled experiment. We call thisย Chaos Engineering.โ
- Testing in Production: Yes, You Can (and Should)
Chaos engineering, of course, means testing in production (TiP) โ i.e. performing various tests in a production state or live environment. This can, of course, seem a little scary in the context of chaos โ where engineers are deliberately injecting failure into a live environment, and thereby impacting the end-user experience of an application in one way or another.
But, as Charity Majors, CEO at monitoring tool vendor Honeycomb, puts it her blog post โ โTesting in Production: Yes, You Can (and Should)โ โ for opensource.com, failure is not your enemy, but your friend.
โEmbrace failure. Chaos and failure are your friends. The issue is notย ifย you will fail, it isย whenย you will fail, andย whetherย you will notice. It’s between whether it will annoyย allย of your users because the entire site is down, or if it will annoy onlyย a fewย users until you fix it at your leisure the next morning. Once upon a time, these were optional skills, even specialties.ย Not anymore.ย These are table stakes in your new career as a distributed systems engineer.โ
- Chaos Engineering โ Withstanding Turbulent Conditions in Production
Of course, testing in production is a little like playing with fire. Youโre introducing failure to a real-world setting, which means that things can go wrong. Naturally, in a chaos experiment, your aim will be to keep your experiment contained by isolating an experimental group, and exposing only that group to a simulated real-word failure event, such as a server crash, or a large spike in traffic.
However, chaos engineering is not about creating chaos, rather preventing it, and the whole purpose of testing in production is to uncover weaknesses to be fixed, and thereby build up the resilience of your system. In this excellent post โ โWithstanding Turbulent Conditions in Productionโ โ Codecentric goes into great detail about designing robust chaos experiments that are at once safe and insightful.
โThe aim of Chaos Engineering is to run it in production, always being in control of the situation and keeping customers unaffected.โ
- Chaos Engineering Is Not Just Tools โ Itโs Culture
Failure as a Service provider Gremlin knows a thing or two about chaos. Perhaps most importantly that chaos tools are just tools โ in order to find success with chaos, organizations need to embrace failure as a culture.
In the companyโs excellent post โChaos Engineering Is Not Just Tools โ Itโs Cultureโ, Gremlin illustrates the importance of chaos being embraced by the whole organization โ not just isolated teams.
โMake sure everyone has full access to all chaos and monitoring tools so anyone can see what experiments are running and halt them at any time. At no point should anyone run risky experiments without warning others. Though if someone breaks that rule โ and some day, they will โ everyone needs access to the kill switch. This is the practice of Chaos Engineering. Yes, each service team may curate its own documentation and architecture diagrams from the comfort of its own silo, yet these documents โ as current as anyone can keep them โ are no substitute for collaborative practice.โ
- The Limitations of Chaos Engineering
While chaos is undoubtedly a great tool for improving the resilience of your system, itโs not a panacea. This is something that freelance solutions architect Mathias Lafeldt explains passionately in his Medium post โThe Limitations of Chaos Engineeringโ.
The truth is, chaos may not always be the right fit for your company, and if youโre looking a balanced view on the practiceโs limitations, you should definitely check out Lafeldtโs wise words of warning before jumping in.
To be clear, Lafeldt is a chaos advocate โ but, as he eloquently puts it, โBeing an advocate of something doesnโt mean you should close your eyes to its downsides and limitations. In fact, the most skilled engineers are well aware of the prosย andย cons of their favorite tool or method, and consider them carefully. [โฆ] Chaos Engineering is a means to an end, not an end in itself. Experimenting on a distributed system is of great worth, but what matters, in the end, is theย production serviceย you aim to improve in the first place. Breaking things is a ton of fun, I can attest to that, but as long as you donโt feed results backโ โ โby fixing flaws, tweaking runbooks, training people โ โyour chaos experiments are rarely more than a time killer.โ
Final Thoughts
Chaos engineering isnโt just a buzzword โ itโs here to stay, and every software engineer should, at the very least, be familiar the basics of proactive failure testing as means to create better, more robust, and more resilient systems. If you want to improve your knowledge, then you need to start reading. The blog posts and resources highlighted above are a great place to start your chaos journey, but youโll need to keep exploring, keep reading, and eventually start seriously considering whether chaos can help you build the one thing your system needs most โ resilience.
Top Chaos Engineering Blogs
Top chaos engineering blogs: 1. Principles of Chaos Engineering. 2. Testing in Production: Yes, You Can (and Should). 3. Chaos Engineering โ Withstanding Turbulent Conditions in Production. 4. Chaos Engineering Is Not Just Tools โ Itโs Culture and 5. The Limitations of Chaos Engineering