Large organizations are like dinosaurs; it takes time for a message to make it all the way from the tail, where pain is felt, to the head, where decisions can be made, and back down to the tail. In IT Operations, which must react in real time to an environment that is evolving constantly, this can become a major problem.
With Big Data Comes Big Problems
Most companies above a certain scale have some sort of “Big Data” initiative. For better or worse, it is one of the hottest buzzwords around, this makes it a must-have for companies in every sector. IT Operations is no more immune to fads than other groups; some might even say that it is more susceptible to the sudden enthusiasms and fashions that sweep the sector. Accordingly, many IT departments have an approach to gathering data from the infrastructure that they are responsible for that translates to “MONITOR ALL THE THINGS!”.
Gathering data used to be a relatively complex affair, but these days, it’s trivially easy. There are a ton of good monitoring frameworks out there, and more modern infrastructure generally includes its own instrumentation. This means that anyone wanting to set up monitoring can easily gather all the data that anyone could ever want, and more. The difficulty is in analysing that stupendous flood of data, making sense of it, and generating actionable insights for operations teams to follow up on.
Static Rules Can’t Govern Dynamic Infrastructure
Part of the problem is due to changes to the infrastructure components themselves, and in how logical application components are deployed to them. IT used to be made up of static building blocks that did not change too much over their operating lifetimes. Releases were leisurely affairs, run on an annual schedule if not even longer.
This meant that a change was a pretty major event that could be planned for some time in advance. This in turn implies that failed changes and their consequences could be identified quite easily. When outages did occur, a big part of the troubleshooting effort involved looking for a single change that might have caused the entire problem. The ideal was of course to figure out a way to roll that change back, returning the environment to a known good state – but due to the sheer massive size and complexity involved in that change, this was rarely possible. Most IT teams ended up having to make their fixes in production and hope for the best.
These days, infrastructure is largely virtual, constantly changing, and perhaps even self-modifying in response to component failures or changes in demand. Releases too have accelerated enormously, often to a daily frequency, sometimes even beyond that breakneck pace. IT departments are no longer operating in a steady-state equilibrium, infrequently punctured by monolithic changes, relatively simple (if resource-intensive) to predict and manage. Instead, the new normal is defined by almost continuous change, both of application services and of the infrastructure components that those application services rely on. These constant micro-changes cannot be predicted and managed according to the old models. Most outages and performance issues can rarely be tied to a single root cause. So many changes have occurred, and the scope of each one is so small, that it is far more probable that what caused the problem is an unforeseen interaction between many different, seemingly unrelated, changes.
Micro-Changes Can Have Huge Impacts
Slowing down the rate of change is not an option, because it is determined as an external input to the system. IT agility – meaning its ability to accept and deliver changes – translates directly to business agility. Any business that is not able to evolve rapidly in response to changing market conditions – whether systematic changes, moves by competitors, or new opportunities – is at a substantial disadvantage. Vice versa, businesses that can take friction out of that process can accumulate an advantage in the field, by moving faster than their peers and taking advantage of new opportunities as quickly as they appear.
The only way forward for organisations that intend to continue participating in this highly dynamic environment is therefore to move away from outdated IT Operations methodologies (and the tooling that implements them) that are rigid and slow to react, and instead to consider new approaches that are dynamic enough to accommodate the new regime of constant micro-changes.
AIOps and The Future of IT Operations
The emerging field of Algorithmic IT Operations (AIOps) proposes to do just that. AIOps avoids reliance on exhaustive documentation of physical assets, their relationships, and the infrequent changes to both. Instead, by using data science approaches to evaluate the huge volumes of data generated by modern IT infrastructures, it helps people and machines work together to find clarity in chaos and accelerate innovation. What determines this change is not (just) technical factors, but the recognition of a fundamental change in the nature of IT and business and the importance of IT Agility.
The benefit of adopting this new, more dynamic approach to IT Operations will be that IT can move from being a drag on the organisation’s overall agility to being its enabler. Businesses whose IT departments are rigid and inflexible tend to think of them as cost centres and class them in the same general category as Facilities. This is not great for either party. Instead, agile and proactive IT departments that make a real contribution to the business success become an asset and even a competitive differentiator.