DevOps

The major phases of Incident Management

All of this is a set of information for major phases of incident management.

Didier Caroff

Jul 4, 2017 — 4 min read

All of this is a set of information for major phases of incident management.

When an issue occurs, several bad behaviours appear: panic, scattered team, fear, stress, etc …

All of this is un-productive and creates the worst environment to tackle the issue in good conditions. To reduce mitigation and risk, a well-structured incident management plan should be prepared. Some articles I read, provide 5, 6 or more stages to be effective during an incident. From my experience, 5 stages are enough:

1/ Detection: monitoring, metrics and thresholds

How do you know something has gone wrong?

It’s one of the hard part due to the multiplicity of systems, technologies and their lifetimes (which gets shorter and shorter).

Be warned in time, easy to say and hard to do.

Alerts must have a reasonable thresholds (too many alerts equals nobody takes care), the right recipient and then the right tool to be called, like Pagerduty or OpsGenie.

You never know what the next incident will be, but if you have the right tool of detection then you will have a deeper understanding of the system and will be far more prepared for the unknown.

2/ Identify metrics to drive improvement

How are alerts delivered, triaged, and escalated?

To reach this goal: process management, communication in the language of the business, ensure cross-silo collaboration and at least incident triage.

First, a clear process must be in place. The alerts or issues are well delivered, a daily incident triage and an Escalation team guardian of the technical platform.

On-call people can be split in 3 different groups:

The first level receives the first alert of all systems where you can have a business impact or not. Alerts must be well documented and explained. A typical alert failing is when the procedure says to call this team. No way. A good alert typically needs a first action to do. All in one, your process must be aligned. You can use the OODA loop (Observe, Orient/Analyse, Decide, Act) to help you to reach this goal.
Next level is the level 2 where you can have SRE/Infra or Escalation team at large (people who manage systems, infrastructure and applications). Those 2 teams have a oncall planning and a team rotation.
Last level, level 3, is for developers team/application owners. Those people are called “on demand” and don’t have a onduty planning.

3/ Remediation: fixes tickets, Mean time to detect, Mean time to resolve

What are the actions taken once an incident is identified?

Quick resolution is the key. You can scale up, rollback, redirect traffic or spot the differences with historical graphs. You need to get immediate help from all staff and process by elimination method, by dividing the target into components.

Here few questions who can help:

What makes you think there is a performance problem?
Has this system ever performed well?
What has changed recently? (Software? Hardware? Load?)
Can the performance degradation be expressed in terms of latency or run time?
Does the problem affect other people or applications (or is it just you)?
What is the environment? What software or hardware is used? Versions? Configuration?

Some useful tools during a firefighting:

Internal Communications (Slack or HipChat)
Application Performance and Infrastructure Monitoring (Nagios, Prometheus, …)
Status Updates (statuspage.io for external customers and/or template email for internal)
Ticketing Solution (Jira, Confluence)
Procedure Tracking (Postmortem)

Incident workflow example:

4/ Analysis: postmortem, 5 why, understand, non-reoccurence, process ownership, root cause categorization

How are incidents analyzed following remediation?

Globally, it’s up to 30% unknown root causes causing incident to reoccur. To avoid reoccurrence, you need:

the root cause,
the postmortem,
the postactions.

Here a list of 25 possible root cause and its description:

Code issue: regression due to a line of code
Configuration issue: regression due to a configuration change in the application or system
Integration issue: regression due to an integration/implementation/migration change in the application or the system
Release validation incorrect/incomplete: the post-check deployment was incorrect/incomplete
Service outage/degradation: any period of complete disruption or interruption of an IT service
Hardware outage/degradation: a hardware component has an issue
Data incorrect/incomplete: data contain misleading, inaccurate or incomplete information
Network outage/degradation: network equipment X is down or degraded
Database outage/degradation: database Y is down or degraded
Access rights outage/degradation: regression on access rights for someone
Monitoring missing/degradation: issue due to an alert missing or not complete
ABTest issue: regression due to a bad ABTest configuration created, updated or stopped
Decommission issue: regression due to a decommission of an application or a server
Capacity issue: quota for device reached due to volume increase
By design: a fail but expected
Document incorrect/incomplete: instruction was incomplete for onduty guy
Client issue: incident due to a client
Partner issue: incident due to a partner, external vendor or supplier
Unpredictable event: theft, sabotage, earthquake, flood
Issue incorrectly escalated: priority was incorrectly set
Instructions incorrectly followed: instructions not performed as expected
Communication issue: issue due to a lack of communication between 2 teams
Duplicated: issue duplicated with another tracked issue
Not reproduced: happen one time and can’t be reproduce
Root cause unknown: the bad, the ugly, the evil root cause

For postmortem and post actions, we use the 5 Whys performance method to go deeply on the problem:

Given delivered performance, ask, “why?”, then answer this question
5 Given previous answer, ask, “why?”, then answer this question

More details on this part can be found here https://medium.com/@kwa29/it-post-mortem-guidelines-77214c6e7e34

5/ Readiness: improvement, game days, learning, freeze period, chaos monkey

What are the processes in place for effective incident response?

Several processes help people having an effective incident response:

a knowledge base,
a capacity planning to assess the production capacity needs,
a continuous improvement culture,
a deployment frequency with small iterations,
a frequent training both external and internal,
some freeze period during key business date (EOY, BlackFriday, Sales, …) to protect the business
chaos Monkey is a service that randomly terminates VM instances and containers

At last and from my point, the future of Incident Mangement will be a mix of AI and Processes.

The major phases of Incident Management

Didier Caroff

Read more

Organizational Modes in Software Development

Management and Leadership in Software Development

Secrets Management 101s

Coming soon