How do you reduce incident resolution time? (MTTR)

A while back I spoke about what is MTTR and MTTD and why they matter in software delivery or at an organization. Today I’ll go in a bit more in depth and talk about how to reduce incident resolution time at your company. First, I’ll go into some general basics but this article assumes you already understand MTTR and are able to measure it. If you aren’t there yet, I can write another article or read my previous one to get an idea. Let’s get into it!

How do you reduce MTTR? (mean time to resolution)

Mean Time to Resolution (MTTR) is a key metric in incident management that measures the average time from when an incident is reported until it’s resolved. For context, an example would be like this. If I have two incidents. I take 1000 minutes to resolve one and I take 500 minutes to resolve the other. My MTTR average is now 750 minutes (1000 + 500 / 2). If that’s good or bad will depend on your organization but you can benchmark yourself with dora metrics and figure out what elite or good might look like. It’s important to not overtly optimize for process but really understand why we’re doing specific things and think of your customers.

Here are some general best practices and tips on reducing MTTR:

  1. Incident Logging and Categorization: Properly log every incident with sufficient details and categorize them appropriately. This helps in identifying the kind of issues that are most prevalent and which need immediate attention. For example, when an incident happens, ask someone who responded it to fill out some documentation that outlines some basic questions.
    • Name of incident
    • Area of impact
    • Financial impact?
    • Why did it happen / root cause
    • Action items
    • Timeline
  2. Automation: Automating repetitive tasks can significantly reduce the resolution time. Consider implementing automated processes to create a lot of the administrative work. Ie, automated zoom calls, slack channels, pager duty documentation or confluence documentation etc. Sometimes a lot of the response time is eaten up by “where do i go and who is doing what”
  3. Prioritization: Not all incidents are of equal importance. Implement a system of prioritization based on the business impact of incidents. Critical incidents that affect key functionalities or a significant number of users should be prioritized for resolution. Ie, a low priority low impact incident is still important but may not get the full attention of production being down. Treat both importantly but with different types of urgency
  4. Effective Escalation: Establish an efficient escalation process for incidents that cannot be resolved immediately. This will help ensure that they are quickly brought to the attention of higher-level teams or individuals who have the necessary skills or authority to deal with them. For example, do you have on call rotations for different parts of your company? Is it a dedicated team? What is the rotation schedule? Are we ensuring multiple people are exposed to incidents so there’s not a single point of failure? Tools like pager duty and grafana on call are good for this.
  5. Knowledge Management: Maintain a knowledge base of past incidents and their resolutions. This can be an invaluable resource for solving similar incidents in the future. Make sure this information is easy to find and use. For example, you can use pager duty (I find pagerduty not the best but you can use it), confluence or some other central repository with date stamps work well here.
  6. Training and Skill Development: Invest in training and skill development for your incident management team. This will enable them to resolve incidents more effectively and efficiently. Having a single point of failure or a few people who have tribal knowledge of the incident response process isn’t the best. Documenting how to respond to different incidents and training others across the organization will make you more resilient.
  7. Regular Review and Analysis: Regularly review and analyze your incident management process to identify bottlenecks and areas for improvement. Look at the metrics and understand where the time is being spent during the incident lifecycle.
  8. Post-Mortem Analysis: After resolving an incident, conduct a post-mortem analysis to understand the root cause of the issue. This can help prevent similar incidents in the future, reducing the overall number of incidents that need to be resolved. More on this later but this is one of the most important insights. Ensure you build a culture where jr engineers all the way to staff are participating on this meeting. Sharing insights and where we went wrong to improve is absolutely key.
  9. Collaboration and Communication: Effective collaboration and communication are key to quick resolution. Use collaborative tools to streamline communication among the teams involved in incident resolution. Ie, do you have a status page or communication when an incident happens? What is your communication strategy here. If stakeholders are left in the dark you’re going to get a lot of escalations from executives.
  10. Proactive Monitoring: Implement proactive monitoring and alerting of your systems to catch and resolve incidents before they affect your users. This not only reduces MTTR but also improves user satisfaction. Monitoring is tricky but absolutely vital. You can setup monitoring channels or guilds that monitor specific parts of the organization. Sharing monitoring and creating automated alerts are key. It’s important to note when monitoring and an alert does happen, that there is a response or a fine tuning of it. If there’s monitoring but no one is looking at it, it’s not very useful.
  11. Service Level Agreement (SLA) Management: Understand your SLAs and manage them effectively. A well-defined SLA can help prioritize incident resolution based on the business impact. Some organizations have SLAs in regards to external customers. Understanding them and knowing what the impacts of the SLAs are crucial during an incident.

Remember, the objective is not just to reduce MTTR but also to balance it with the quality of the solution. It is important to aim for a thorough resolution that addresses the root cause of the incident rather than just a quick fix that might lead to the issue recurring. We’re also now going to talk about MTTD (mean time to resolution).

How do you reduce MTTD? (mean time to detection)

What is MTTD? Why does it relate to MTTR? Mean Time to Detect (MTTD) is a valuable performance metric used in incident management. It measures the average time it takes for an organization to discover an incident from when it first occurs. The goal for any organization is to keep MTTD as low as possible because the sooner an incident is detected, the sooner it can be resolved. Here are some strategies to help improve MTTD:

  1. Proactive Monitoring: This is one of the most effective ways to improve MTTD. Regularly monitor your systems for signs of failure or degradation. This can be done using various tools that automatically check system health and report anomalies. The key is to identify and address potential issues before they become actual incidents.
  2. Alerting Systems: Implement alerting systems that notify the appropriate personnel as soon as an incident occurs or even when something looks suspicious. These systems can be based on predefined rules or thresholds, and can provide real-time updates to keep everyone informed.
  3. Anomaly Detection: Use machine learning algorithms to identify unusual patterns in system behavior that may indicate a potential incident. This can be particularly effective in complex systems where it may be difficult to define all possible failure conditions in advance.
  4. Automated Testing: Implement automated testing to constantly check system functionality. This can help catch incidents immediately after they occur, especially those caused by recent changes or updates.
  5. Log Analysis: Regularly analyze system logs to detect signs of incidents.
  6. Incident Management Tools: Use dedicated incident management tools that can consolidate information from different monitoring and alerting systems. These tools can provide a holistic view of the system and can be instrumental in detecting incidents quickly.
  7. Feedback Loops with Users: Users can often be the first to notice an incident, especially if it’s impacting the functionality they rely on. Encourage users to report any issues they encounter and ensure there are easy ways for them to do so.
  8. Regular System Audits: Perform regular audits of your systems to check for potential issues. This can be particularly useful for detecting security incidents, which may not always result in obvious system degradation.
  9. Take insights out of the post mortem process: Discussed a great topic out of the post mortem process? Make sure it’s applied so you can improve MTTD / MTTR. Insights are super valuable and when knowledge is shared it’s a great thing.

Remember, MTTD is just one part of the larger incident management process. While it’s important to detect incidents quickly, it’s also important to resolve them efficiently and to learn from them to prevent similar incidents in the future

How are incident levels useful in all this and why should we define them?

Defining incident levels, often referred to as incident prioritization or severity levels, is a key aspect of incident management. It involves classifying incidents based on their impact on business operations and the urgency with which they need to be resolved.

This classification typically involves multiple levels of severity, such as:

  1. Critical: The highest priority incidents, typically involving a total service outage or a security breach, affecting all or a large subset of users.
  2. High: Significant issues that have a broad impact, but are not completely halting operations.
  3. Medium: Issues affecting a smaller number of users or functionalities, or which have a workaround available.
  4. Low: Minor issues with little impact on operations or users, or issues affecting only a single user.

You can use a level or number system as well. Ie, Sev1,2,3,4, P1,2,3,4. Lots of flavours of this. There are several reasons why defining incident levels can help reduce both MTTD (Mean Time to Detect) and MTTR (Mean Time to Resolution):

  1. Improved Prioritization: When incidents are properly classified, it becomes easier for the team to prioritize their efforts. High severity incidents can be tackled first, ensuring that the most critical issues are resolved as quickly as possible.
  2. Better Resource Allocation: Incident levels allow for more effective allocation of resources. For instance, more experienced or specialized team members can be assigned to resolve higher severity incidents, while lower severity incidents can be handled by less experienced team members or through automated systems. You wouldn’t call everyone on a high priority sev4 (low priority issue) but you may call your most senior specialized engineers on a critical incident (sev 1)
  3. Streamlined Communication: When an incident’s severity level is clear, it can facilitate better communication both within the team and with stakeholders. Everyone understands the impact of the incident and can align their actions accordingly.
  4. Efficient Escalation Process: With clear incident levels, escalation processes become more efficient. If a lower severity incident evolves into a higher severity one, it’s easier to escalate it to the appropriate individuals or teams. Communication plans are important here as well. Ie, during a sev4 (low priority incident) you may not inform business stakeholders. However, during Sev1s and Sev2s you would probably want to inform your business partners for impact to the customer.
  5. Improved Incident Analysis: Over time, analyzing incidents based on their levels can provide valuable insights. You can identify patterns, such as certain types of incidents consistently being high severity, and take steps to address the underlying causes. You can also define financial impact of an incident. Ie, a Sev 1 (critical incident) costs us $500,000 per incident and generally takes 500 minutes to resolve. You’ll get a ton of insights over time.

Remember, incident levels should be defined according to the specific needs and context of your organization. For example, what constitutes a ‘Critical’ incident for a healthcare organization might be very different from a software development company. Regularly review and update your incident levels as your organization and its needs evolve. Most of my experience in software development but a similar process has been used for banking as well.

What about the post mortem meeting? Why is this useful in terms of incidents?

Post-mortem analyses, also known as incident reviews or root cause analyses, are critical components of incident management. These analyses take place after an incident is resolved and involve a detailed examination of what happened, why it happened, how it was handled, and what can be done to prevent similar incidents in the future.

The goal of the post-mortem process is not merely to understand the incident, but to generate actionable insights that can improve future incident management. These insights are usually recorded as action items.

Action items play an important part for several reasons:

  1. Preventing Recurrence: Action items often involve steps to address the root cause of the incident, which can help prevent similar incidents from happening again in the future. If you do this entire process and don’t action anything. A cycle of incidents will continue with a higher degree of severity.
  2. Process Improvement: They can highlight ways to improve your incident management process, such as updating your incident response plan, refining your escalation procedures, or investing in better tools. For example, you may want a lighter weight incident response process for lower incidents. However, you should still fill out information required for all incidents.
  3. Knowledge Building: Action items contribute to building an organization-wide knowledge base about incident management. This knowledge can be invaluable for training new team members and for handling future incidents more effectively. Having meetings where multiple stakeholders can attend this is key for knowledge building (Ie, business, tech along with jr and sr members).
  4. Responsibility Assignment: When action items are assigned to specific individuals or teams, it ensures that there’s clear responsibility for implementing the insights gained from the post-mortem. It’s important to have accountability here and ensure the items get done. Without this, incidents will re-occur and MTTR will increase over time.

If you don’t action anything from the post-mortem process, you miss out on these benefits. In effect, you’re not learning from your past incidents. This can lead to several negative outcomes:

  1. Repeated Incidents: If the root causes of incidents aren’t addressed, the same incidents are likely to happen again and again. This can lead to increased downtime, decreased user satisfaction, and potentially higher costs for incident resolution.
  2. Inefficient Processes: Without learning from past incidents, you may continue to use inefficient processes for incident management. This can result in longer MTTD and MTTR, which can further impact your service quality and user satisfaction.
  3. Missed Opportunities for Improvement: Every incident, even the negative ones, is an opportunity for learning and improvement. If you’re not taking action based on your post-mortem analyses, you’re missing out on these opportunities.

Therefore, it’s important not just to conduct post-mortem analyses, but also to follow through on the action items that come out of them. Regularly review your action items, track their implementation, and evaluate their impact on your incident management process.

If you don’t action anything from the post-mortem process, you miss out on these benefits. In effect, you’re not learning from your past incidents. This can lead to several negative outcomes:

  1. Repeated Incidents: If the root causes of incidents aren’t addressed, the same incidents are likely to happen again and again. This can lead to increased downtime, decreased user satisfaction, and potentially higher costs for incident resolution.
  2. Inefficient Processes: Without learning from past incidents, you may continue to use inefficient processes for incident management. This can result in longer MTTD and MTTR, which can further impact your service quality and user satisfaction.
  3. Missed Opportunities for Improvement: Every incident, even the negative ones, is an opportunity for learning and improvement. If you’re not taking action based on your post-mortem analyses, you’re missing out on these opportunities.

Closing

Process improvement around incidents takes time. It’s important to note that this could take several weeks, months or years depending on the maturity of your organization and attitudes towards incidents. Post mortems should be blameless and the idea behind them is to improve as an organization. Lowering your MTTR is about visibility, customer impact, communication plans and improvement at all level of the organization. Make sure you start somewhere and continue to improve. If you have any questions or want to implement this at your organization, feel free to reach out!

Latest Articles:

Leave a Comment