This document in the Google Cloud Architecture Framework provides best practices to manage services and define processes to respond to incidents. Incidents occur in all services, so you need a well-documented process to efficiently respond to these issues and mitigate them.
Incident management overview
It's inevitable that your well-designed system eventually fails to meet its SLOs. In the absence of an SLO, your customers loosely define what the acceptable service level is themselves from their past experience. Customers escalate to your technical support or similar group, regardless of what's in your SLA.
To properly serve your customers, establish and regularly test an incident management plan. The plan can be as short as a single-page checklist with ten items. This process helps your team to reduce time to detect (TTD) and time to mitigate (TTM).
TTM is preferred as opposed to TTR, where the R for repair or recovery is often used to mean a full fix versus mitigation. TTM emphasizes fast mitigation to quickly end the customer impact of an outage, and then start the often much longer process to fully fix the problem.
A well-designed system where operations are excellent increases the time between failures (TBF). In other words, operational principles for reliability, including good incident management, aim to make failures less frequent.
To run reliable services, apply the following best practices in your incident management process.
Assign clear service ownership
All services and their critical dependencies must have clear owners responsible for adherence to their SLOs. If there are reorganizations or team attrition, engineering leads must ensure that ownership is explicitly handed off to a new team, along with the documentation and training as required. The owners of a service must be easily discoverable by other teams.
Reduce time to detect (TTD) with well tuned alerts
Before you can reduce TTD, review and implement the recommendations in the build observability into your infrastructure and applications and define your reliability goals sections. For example, disambiguate between application issues and underlying cloud issues.
A well-tuned set of SLIs alerts your team at the right time without alert overload. For more information, see Build efficient alerts and Tune up your SLI metrics: CRE life lessons.
Reduce time to mitigate (TTM) with incident management plans and training
To reduce TTM, define a documented and well-exercised incident management plan. Have readily available data on what's changed in the environment. Make sure that teams know generic mitigations they can quickly apply to minimize TTM. These mitigation techniques include draining, rolling back changes, upsizing resources, and degrading quality of service.
As discussed elsewhere in the Architecture Framework, create reliable operational processes and tools to support the safe and rapid rollback of changes.
Design dashboard layouts and content to minimize TTM
Organize your service dashboard layout and navigation so that an operator can determine in a minute or two if the service and all of its critical dependencies are running. To quickly pinpoint potential causes of problems, operators must be able to scan all charts on the dashboard to rapidly look for graphs that change significantly at the time of the alert.
The following list of example graphs might be on your dashboard to help troubleshoot issues. Incident responders should be able to glance at them in a single view:
- Service level indicators, such as successful requests divided by total valid requests
- Configuration and binary rollouts
- Requests per second to the system
- Error responses per second from the system
- Requests per second from the system to its dependencies
- Error responses per second to the system from its dependencies
Other common graphs to help troubleshoot include latency, saturation, request size, response size, query cost, thread pool utilization, and Java virtual machine (JVM) metrics (where applicable). Saturation refers to fullness by some limit such as quota or system memory size. Thread pool utilization looks for regressions due to pool exhaustion.
Test the placement of these graphs against a few outage scenarios to ensure that the most important graphs are near the top, and that the order of the graphs matches your standard diagnostic workflow. You can also apply machine learning and statistical anomaly detection to surface the right subset of these graphs.
Document diagnostic procedures and mitigation for known outage scenarios
Write playbooks and link to them from alert notifications. If these documents are accessible from the alert notifications, operators can quickly get the information they need to troubleshoot and mitigate problems.
Use blameless postmortems to learn from outages and prevent recurrences
Establish a blameless postmortem culture and an incident review process. Blameless means that your team evaluates and documents what went wrong in an objective manner, without the need to assign blame.
Mistakes are opportunities to learn, not a cause for criticism. Always aim to make the system more resilient so that it can recover quickly from human error, or even better, detect and prevent human error. Extract as much learning as possible from each postmortem and follow up diligently on each postmortem action item in order to make outages less frequent, thereby increasing TBF.
Incident management plan example
Production issues have been detected, such as through an alert or page, or escalated to me:
- Should I delegate to someone else?
- Yes, if you and your team can't resolve the issue.
- Is this issue a privacy or security breach?
- If yes, delegate to the privacy or security team.
- Is this issue an emergency or are SLOs at risk?
- If in doubt, treat it as an emergency.
- Should I involve more people?
- Yes, if it impacts more than X% of customers or if it takes more than Y minutes to resolve. If in doubt, always involve more people, especially within business hours.
- Define a primary communications channel, such as IRC, Hangouts Chat, or Slack.
- Delegate previously defined roles, such as the following:
- Incident commander who is responsible for overall coordination.
- Communications lead who is responsible for internal and external communications.
- Operations lead who is responsible to mitigate the issue.
- Define when the incident is over. This decision might require an acknowledgment from a support representative or other similar teams.
- Collaborate on the blameless postmortem.
- Attend a postmortem incident review meeting to discuss and staff action items.
Recommendations
To apply the guidance in the Architecture Framework to your own environment, follow these recommendations::
- Establish an incident management plan, and train your teams to use it.
- To reduce TTD, implement the recommendations to build observability into your infrastructure and applications.
- Build a "What's changed?" dashboard that you can glance at when there's an incident.
- Document query snippets or build a Looker Studio dashboard with frequent log queries.
- Evaluate Firebase Remote Config to mitigate rollout issues for mobile applications.
- Test failure recovery, including restoring data from backups, to decrease TTM for a subset of your incidents.
- Design for and test configuration and binary rollbacks.
- Replicate data across regions for disaster recovery and use disaster recovery tests to decrease TTM after regional outages.
- Design a multi-region architecture for resilience to regional outages if the business need for high availability justifies the cost, to increase TBF.
What's next
Learn more about how to build a collaborative incident management process with the following resources:
Explore recommendations in other pillars of the Architecture Framework.