technology failure

Giving the Benefit of Hanlon’s Razor

Next Story

High Tech Inventors Alliance Formed to Support Balanced Patent Policy

There’s usually a good reason for things to be the way they are

The first reaction when something goes wrong is often to cry: “look at what those idiots did!”. When British Airways experienced their recent power outage, everyone rushed to bike-shed what they had done wrong. Was it cost-cutting, was it out-sourcing and off-shoring, or was it perhaps a combination? Lack of any concrete information rarely even slows down the rush to opine on what the key mistake was. However, some simple maths has led me to believe that no more than half of our professional colleagues can possibly be below average – so if you see something that looks incomprehensibly wrong, the odds are good that there is a valid reason for it that you’re not aware of.

The principle of Hanlon’s razor goes something like this “Never attribute to malice that which is adequately explained by stupidity”

Lessons learned

I learned this lesson early in my career, when as a cub sysadmin, I took part in a datacenter move. Most of the systems came back up fine at the new location – except for two HP-UX boxes. Upon investigating the problem, we found that they were trying to cross-mount filesystems from each other, and therefore were doing the equivalent dance to two overly-polite people, each refusing to be the first through the door, and holding up everyone else behind them.

Pretty stupid, huh?

But there was a reason for it: the data center move occurred right in the middle of a system consolidation and migration from HP-UX 10.20 to 11, so those two machines were each doing double-duty, with aliases allowing them to act as a total of four servers. The original four servers had a reasonably sane setup with no cross-mounting, but at some point during the migration, the DNS aliases had become switched, and the filesystems got all crossed up.

Many individual steps had been taken over the course of the migration project, all of which made perfect sense at the time – and the result was so hilariously stupid that we’re still laughing about it nearly two decades later.

A series of unfortunate events

Most IT catastrophes boil down to a similar series of unfortunate events. It is very satisfying to imagine that there is One Root Cause, and if only we can identify it and avoid it in the future, all will be well. This very human tendency can often make us too quick blame either malice or stupidity during the post-mortem analysis, especially from the outside. But what if it all made perfect sense at the time?

If we were starting from scratch today, very few large IT environments would look like they do. Most line-of-business apps would probably be much better off in some sort of PaaS, for instance. Old, unmaintainable systems loaded down with technical debt would not exist by definition. Everything possible would be software-defined.

Unfortunately, that’s not the world we live in today – but the reasons for things being the way they are were perfectly valid at the time the relevant decisions were made. Calling people names now, today, is not going to help anyone.

Better ways of managing your IT environment

These days, I no longer manage servers by tiling half-a-dozen xterms across my 20” CRT (a big deal back in those days, let me tell you!). Instead, I have spent the past decade-plus trying to help people find better ways of managing their changing IT environment. Unfortunately, many people in my position give advice that is at best naïve and simplistic, and generally boils down to copying the latest idea from the FANGs (Facebook, Amazon, Netflix, and Google). The question that never gets addressed is whether such slavish flattery is a good fit for companies that are not FANGs and have a bit more history and legacy to deal with. It’s all very well to say that we should “treat infrastructure as code”, but it’s hard to do that with physical infrastructure and applications that assume persistent hardware to run on. Even Amazon doesn’t run 100% on AWS…

This is not to say that there is no need to change our approach to IT management. Too many commonly-held assumptions are rooted in a worldview that is no longer valid:

  • Once deployed, hardware no longer changes (except for failures)
  • End to end deployment time for new hardware can be measured in days if not weeks
  • New hardware will only ever show up in the datacenter through official processes (usually a purchase order from a vendor)
  • Hardware is hardly ever destroyed unpredictably
  • Adjusting capacity in response to user demand is a process that takes many meetings over months if not years
  • The distinction between hardware and software is always clear-cut and obvious
  • Software deployment is an infrequent occurrence and can be prepared for in depth, including detailed contingency planning

New technology is not static

All of these assumptions boil down to an expectation of a relatively static and unchanging IT environment. The result is a set of equally static management approaches.

Conversely, new enterprise IT technologies are anything but static. On the compute side, the virtualisation wave started around fifteen years ago, but these days pretty much anything can be software-defined, meaning easily and rapidly changeable, including through automated unsupervised processes.

Significant friction (meaning: much heat but very little light) is created when these two worldviews come into contact, if not outright conflict.

We are in the middle of a transition from imperative approaches to IT management (create infrastructure components directly; when a problem occurs, debug & fix it manually) to a declarative approach (describe what is wanted, run tools to reconcile infrastructure to requirements; when anything breaks, simply run the tools again).

This is where we come full circle: both sides are 100% right. There is no One True Way to do things, and an easy excuse for calling people idiots if they do anything different. Legacy systems still have value; one definition of “legacy” is “it’s how you got where you are”. The benefit of upgrading those legacy applications or porting them to more modern architectures is often dwarfed by the risks involved. Vice versa, new approaches really do have significant advantages, but they also need to be managed in new ways.

Bad things can happen

What is needed is, firstly, recognition of the validity of the other position, and the reasons for its coming into existence. Secondly, new generations of tools are required that can support the new processes which can span these two worlds. Everyone wants to be more agile, but sometimes you really do need to wait for the Change Advisory Board to meet, because if you mess up due to moving too fast and breaking the wrong thing, Something Bad could happen.

Note that Something Bad may well not be a technical failure. In the usual way of technology, what were promising new approaches that were not ready for prime time have been maturing rapidly. Other aspects have not kept up with the rapid evolution of the technology. For instance, what about the potentially unlimited license exposure that might be created through seemingly trivial container provisioning? https://medium.com/@JonHall_/dockers-keynote-story-is-great-until-the-software-auditors-knock-on-the-door-19c5292e18f9

There are three legs to the stool: people, process, and technology. As usual, the technology is the easy part, once the people have decided what their process should be. Once you get beyond trivial whiteboard exercises and start to look at rolling out at scale, where functioning approaches already exist, things get more complicated. Distrust anyone who shows up with the One True Solution, especially if it’s a one-size-fits-all approach. What that usually means is that the organisation has to twist and contort itself into the shape of the One True Solution, when it should be the other way around.

The following two tabs change content below.
mm

Dominic Wellington

Dominic Wellington is Director of Marketing for Europe at Moogsoft, helping companies move adopt AIOps to streamline their IT Operations and become more agile and responsive to ever-changing demands. He has been involved in IT operations for a number of years, working in fields as diverse as SecOps, cloud computing, and data center automation.