Welcome to chaos engineering where failure is instructive… Mistakes have the power to turn a creation into something better than it was before… Experience is simply the name we give to our past errors… We all know the sayings, and we’ve all heard the clichés – but clichés only become clichés when they are true. So, if we agree that we learn more from failure than we do from first-time successes, shouldn’t we be trying to fail more often?
The Need for Resilience
Of course, nobody likes to fail, and no software engineer would ever strive towards failure as the ultimate outcome of a project. Quite the opposite, in fact. Developers want their software systems to be operationally successful – which means that these systems need to be resilient against failures.
Infrastructure failures, network failures, application failures – there are so many things that can go wrong when running large-scale distributed software systems that could potentially lead to outages and cause customer harm. Hard disks can fail, a sudden surge in traffic can overload the system, a network outage – you name it. What’s more, as systems scale, they become more complex. And in complex systems, even when all individual services and components are functioning correctly, interactions and unpredictable dependencies between those services and components can cause unpredictable outcomes – outcomes that can trigger outages, poor performance, and other unwanted and unacceptable consequences (more failures, in other words).
Successful software systems are those that are resilient against all potential failures. However, the problem is that these failures are hard to predict – yet, when they happen, they can be extremely costly for the business. Outages, of course, impair customer journeys. Depending on the application, customers may be trying to shop, perform business transactions, or simply get work done – but when outages occur and the service goes down, it’s not only customer satisfaction that’s affected, but the company’s bottom line, too.
The Costs of Failure
Even brief outages can impact a company’s revenue stream and profits. As such, the cost of downtime is becoming one of the most important KPIs (key performance indicators) for many development teams. There are many studies that try to put a figure on how much downtime costs a business. According to Gartner, the average cost of downtime is $5,600 per minute. That mounts up to an eye-watering $336,000 per hour. In 2017, ITIC sent out an independent survey to measure downtime costs. It found that 98% of organizations say that a single hour of downtime costs over $100,000, with 81% putting the figure at over $300,000. For 33% of businesses, 60 minutes of downtime would cost their firms between $1 million and $5 million.
The truth is, of course, that the exact cost of downtime will depend on the business model and size of the organization. If you run an ecommerce store, the cost of downtime will be the number of lost sales multiplied by the average sale amount. If you make your money by running ads, the cost of downtime will be the lost ad revenue during the outage. If you’re running a ride-hailing service, you’d be looking at the number of rides that were lost multiplied by the expected average fare during the time of the outage.
Then, on top of all that, you’ve got to factor in the cost of lost employee productivity – which can, indeed, be substantial. In 2016, IHS Markit surveyed 400 companies and found downtime was costing them a collective $700 billion per year – 78% of which was from lost employee productivity during outages. “Our research found that the cost of ICT downtime is substantial, from $1 million a year for a typical mid-size company to over $60 million for a large enterprise,” said Matthias Machowinski, Directing Analyst for Enterprise Networks and Video at IHS. “The main cost of downtime is lost productivity and revenue. Fixing the problem is a minor cost factor, which means a small investment in increasing the reliability of ICT systems will provide an outsized return by reducing productivity and revenue losses.”
In sum, a single outage can potentially cost an organization hundreds of thousands if not millions of dollars. Companies need a workable solution to this challenge – waiting around for the next costly outage to happen is simply not an option. As such, more and more organizations are turning to Chaos Engineering to meet the challenge head on.
What Is Chaos Engineering?
Chaos engineering is the discipline of experimenting on a distributed software system in the form of deliberate failure injection. Failure is instructive, after all, and so the purpose of injecting failure into a system is to test the system’s ability to respond to it. In other words, chaos engineering is the disciplined approach of proactively forcing applications and services to fail in order to learn more about how to keep them running.
Organizations need to identify weaknesses before they manifest in system-wide aberrant behaviors. As such, chaos engineering is about testing how a system responds under stress so that engineers can identify and fix problems before they make the headlines and the company loses millions of dollars for its shareholders.
With chaos engineering, developers quite literally break things on purpose – not to leave them broken, but rather to compare what they think will happen in the face of failure against what actually happens. In this way, the engineer learns precisely how to build and maintain systems that are resilient against infrastructure failures, network failures and application failures.
Despite its name, chaos engineering is anything but chaotic. In reality, chaos engineering involves careful, thoughtful, and meticulously-planned experiments.
In practice, these experiments typically involve four steps. First, engineers start by defining the “steady state” of the system – i.e. what indicates normal behavior. Second, two groups are created – a control group, and an experimental group. Engineers hypothesize on the expected outcome of an injected failure before running it live with the experimental group. Third, variables (i.e. failures) are introduced to the experimental group that reflect real-world events – for example, network connection failures, hard drive failures, server failures, etc. Fourth, engineers test their hypothesis by looking for differences in the steady state between the control group and the experimental group. If the steady state is impacted in the experimental group, engineers have identified a weakness, and can move to address that weakness before it manifests in the system at large.
Importantly, experiments are contained so that the “blast radius” – i.e. the potential real-world impact – of the injected failure is kept to a minimum. Experimenting in production – or as close to the production environment as possible – of course has the potential to cause customer harm. As such, it is down to the careful and disciplined planning of the chaos engineer to design the smallest possible experiments to test in the system, measuring the impact of failure at each step. If an issue is uncovered, the experiment can be halted – otherwise, the blast radius can be carefully increased, always bearing in mind that it may be necessary to abort the experiment in order to prevent any unacceptable impact to the end user. While there must be allowance for some short-term negative impact on any experimental group, it is the responsibility and obligation of the chaos engineer to ensure any fallout from an experiment is minimized and – crucially – contained.
Ultimately, it is the goal of these experiments to continuously introduce random and unpredictable behavior – “what if” scenarios – into a system in order to discover its weaknesses. To give you an example. A distributed software system is designed to handle a certain number of transactions per second. But “what if” that limit is approached, reached or exceeded to the point where performance suffers or the system crashes? Chaos engineering seeks to discover how the software will respond when it experiences such a lack of resources or reaches the point of failure. An experiment is conducted to simulate such a scenario. If the system fails under the test conditions, engineers can go about addressing design changes that adequately accommodate the scenario. Once the changes have been made, the test is then repeated to ensure the solution is solid.
History of Chaos Engineering – The Netflix Story
Chaos engineering is a relatively new approach to software quality assurance (QA) and software testing. One of the concept’s first notable pioneers was Netflix. Netflix first launched its streaming service in 2007 with a library of around 1,000 titles. Its popularity quickly rose, however, and by 2009, this number had grown to around 12,000 titles that subscribers could access on demand.
In 2010, Netflix moved from physical infrastructure to cloud infrastructure provided by Amazon Web Services (AWS). However, this major shift presented a great deal of additional complexity – the level of intricacy and interconnectedness in the distributed system created something that was extremely difficult to manage, and required a new approach to deal with all possible failure scenarios. For example, Netflix needed to be sure that a loss of an AWS instance wouldn’t impact the Netflix streaming experience.
In 2011, the Netflix team decided to address the lack of resilience testing head on by creating a tool to deliberately throw a monkey wrench into the works of the production environment – i.e. the environment used by Netflix customers. This tool was aptly named Chaos Monkey. The overall intent was to move away from a development model that assumed no breakdowns, and towards a model where breakdowns were considered to be inevitable.
“We have found that the best defense against major unexpected failures is to fail often,” wrote the engineering team in the Netflix Tech Blog. “By frequently causing failures, we force our services to be built in a way that is more resilient. […] We have created Chaos Monkey, a program that randomly chooses a server and disables it during its usual hours of activity. Some will find that crazy, but we could not depend on the random occurrence of an event to test our behavior in the face of the very consequences of this event. Knowing that this would happen frequently has created a strong alignment among engineers to build redundancy and process automation to survive such incidents, without impacting the millions of Netflix users. Chaos Monkey is one of our most effective tools to improve the quality of our services.”
The Simian Army
Knowing that one monkey alone doesn’t make a troop, Netflix soon expanded its suite of software testing tools, and the Simian Army was born. The Simian Army added additional failure injection modes on top of Chaos Monkey, enabling testing of further failure states in order to build resilience to those as well.
“The cloud is all about redundancy and fault-tolerance,” wrote Netflix in 2011. “Since no single component can guarantee 100% uptime (and even the most expensive hardware eventually fails), we have to design a cloud architecture where individual components can fail without affecting the availability of the entire system. In effect, we have to be stronger than our weakest link.”
Key combatants in the Simian Army include:
- Latency Monkey – Introduces communication delays to simulate degradation or outages in a network.
- Doctor Monkey – Performs health checks to detect and ultimately remove unhealthy instances.
- Janitor Monkey – Searches for unused resources and disposes of them.
- Security Monkey – Finds security violations and vulnerabilities and terminates offending instances.
- Chaos Gorilla – Similar to Chaos Monkey, but simulates an outage of an entire Amazon availability zone (i.e. one or more entire data centers servicing a geographical location).
Following the introduction of the Simian Army – which continues to grow to this day – Netflix shared the source code for Chaos Monkey on Github in 2012.
In 2014, Netflix officially created a new role – the Chaos Engineer. That same year, Netflix announced Failure Injection Testing (FIT), a new tool that built on the concepts of the Simian Army, by gave engineers greater control over the blast radius of failure injections. In many ways, the Simian Army had been too effective – in some instances it produced large-scale outages, causing many Netflix developers to grow wary of the tools. However, FIT gave developers more granular control over the scope of failure injections, meaning they could gain all the crucial insights released through chaos engineering, yet simultaneously mitigate the potential downsides.
Chaos Engineering Today
Today, chaos engineering is on the rise, with many companies running and offering chaos engineering programs as a service for enterprises to utilize. LinkedIn, for example, uses an open source failure-inducing program called Simoorg. Gremlin is another chaos engineering program, co-founded by former Netflix employee Kolton Andrus. Gremlin offers Failure as a Service, in which chaos engineers run proactive chaos experiments to verify that an organization’s system can withstand failure, and fix it if it can’t.
Many large tech companies – including Twilio, Facebook, Google, Microsoft and Amazon, as well as Netflix and LinkedIn – are practicing chaos engineering today to better understand their distributed systems and architectures, and the list is growing.
The reason is that there are many customer, business and technical benefits of the practice. For customers, chaos engineering ensures increased availability and durability of the services they use, meaning disruptive outages are kept to an absolute minimum. For businesses, chaos engineering helps prevent large revenue losses due to maintenance costs, loss of employee productivity, and, again, to service outages and downtime. On the technical side of things, the insights gleaned from chaos experiments mean a reduction in the number of incidents to deal with, improved system design, and an increased understanding of system failure modes.
Chaos engineering is a powerful practice that is changing the way that software is designed and developed at some of the largest companies around the globe. There is now an official Principles of Chaos Engineering page, an active online community, and dedicated meetups and events taking place all over the world. While the practice is still very young, and the techniques and tools are still evolving, chaos engineering is gaining momentum. Any organization that builds, operates and relies on a distributed software system that wishes to achieve a high rate of development velocity should be investigating the possibilities that chaos engineering offers, for the approach is one of the most effective for improving resiliency. By introducing a bit more chaos in the short-term, a lot more long-term software stability can ultimately be achieved.
The last word goes to Patrick Higgins, UI Engineer at Gremlin. “One of the interesting things, or the important things, about Chaos Engineering is that it’s a practice. It’s continual. Doing it once is not really an effective mechanism. So it needs to be something that is practiced on a regular basis. Perhaps like a gym membership or musical instrument. You can’t just play a trumpet for 36 hours and be really good at it. What I’m trying to encourage is this idea of thinking about failure from an organizational perspective and creating a culture around it.”