problem management

Finding Excellence: Looking at Problem Management

Next Story

Big Data Analysis: Quality VS Quantity

This is the second part of a two-part blog series. The first instalment looked at Change Management.

Unlike change management, a solid problem management process is a solution rather than a cause, mainly because it addresses issues after the fact. As you look at problem management, you’re looking at several factors: the ability to determine which problems to investigate, collection of data for root cause analysis and speed of permanent resolution. Problem management also requires good prioritization so that the right problems are addressed first.

Where do you start?

The first place to look when seeking to improve problem management is the incident management process and operational practices. During their resolution process, the data collected in an incident is the key to success in problem management. Special attention should be paid to the following areas:

  • CI Selection: The affected CI should be checked at resolution to be sure it reflect the CI that was believed to be the culprit. Often, the best the Service Desk can do is reflect the service or application CI as they can tell which service is impacted but not which component caused the failure. At resolution, this is pretty clear as the technician fixing the problem knows which CI was touched to restore service.
  • Classification: if categories are used for trend reporting, the category (and subcategory) should be reconfirmed to ensure the incident is reported in the right place. This will also help with problem identification.
  • Cause codes and resolution codes: these items can help with understanding the type of problem you have: from “user trained” to “hardware repaired” there’s valuable information for call trending if the codes are configured to provide this level of detail.
  • Resolution description: detailed information on how service was restored will speed up root cause analysis when an incident is escalated to a problem, particularly if the same restoration steps are taken during multiple recurrences (said a different way “reboot is NOT a resolution!”)

While data collection helps with repetitive incidents, practices around escalation of major incidents to problem management are also important. I’ve seen everything from all major incidents being escalated to problem management, even when a root cause is known, to the major incident review occuring as part of a problem management process and finally only a small number of incidents being escalated. Other than saying it could be a waste of time to escalate an incident to problem management when the cause is clear and avoidable, any of these levels of maturity is ok, depending on organizational maturity. The key is to find the balance: you want to escalate those issues that have a significant impact on the business and keep the scale appropriate to your resources. To say it another way, if you escalate more problems than you can investigate and resolve, all you have is a large problem backlog. So start small and keep increasing scope until you reach the volume that puts resources at an appropriate level of capacity. This is also an area where a Kanban board can be of assistance (especially if you have a similar functionality in your service management platform). This tool could help by enabling a larger number of problems to be raised, while controlling how many are being addressed at a time, increasing your overall throughput.

For new organizations, starting small and growing Problem Management can work very well. When I implemented Problem Management in an organization with a small IT staff, we escalated only major incidents that recurred (3 strikes you’re out: 3 occurrences in one month caused a problem to be raised), and enterprise-level outages to public facing or vital business functions. As the team’s bandwidth/throughput increased, problems might be raised if the technician resolving the major incident did not know what caused it. Then eventually, the top ten repetitive user-level incidents were added. The slow growth had a positive effect and helped everyone get on board.

Problem resolution techniques

The last area I’d like to address is problem resolution techniques. The organization in which I managed Problem Management didn’t use any specific techniques but I can see where these would have sped up resolution. Many of these techniques were still being developed and there were no real training courses like there are today for Kepner-Tregoe in particular. We did benefit from a solid change management practice that often let us pinpoint the change that was introduced shortly before an issue started occurring and often this helped pinpoint the root cause. We also had a few really good trouble-shooters on the team.

My favorite root cause analysis story is about the issue that took over a year and Microsoft’s engineering team to pinpoint. The problem manager told me the issue was likely a hardware problem in the new domain controller set up the week before (based on the issue and timing). The server team investigated and found no issues with the server. Over a year later (with users suffering from 20-minute log-ins) the problem was found to be a faulty network card in that server, which had two network cards. The problem was intermittent and users only experienced the long log-ins when routing through that card and when the card was failing. Imagine the cost to the business and anger with IT that could have been avoided if only the team had taken the server off-line for a few days!

So where do you stack up? Ask yourself a few questions:

  • What is our average turnaround time for resolving a problem?
  • Is our backlog too large for us to manage effectively?
  • What percentage of issues are permanently resolved with problem management?
  • What percentage of known errors are logged and researched?
  • Do we have a way of communicating work-arounds for known errors to speed up restoration of service?

Obviously, these pointers are only the tip of the iceberg to consider when looking at these two critical processes. I’d love to hear your experiences and wisdom, so feel free to comment.

The following two tabs change content below.
Phyllis Drucker is a business process consultant at Linium. ITIL expert certified with over 20 years' experience in the disciplines and frameworks of IT Service Management as both a practitioner and consultant, she has also served the itSMF since 2004 in a variety of capacities including volunteer, board member and operations director of the US Chapter. She is a frequent contributor of knowledge to the ITSM profession, through numerous presentations, whitepapers and articles. Since 1997, her goal has been to advance the profession of ITSM leaders and practitioners worldwide by providing insight from her experiences on a wide variety of Service Management topics.