All of this is a set of information for major phases of incident management.
When an issue occurs, several bad behaviours appear: panic, scattered team, fear, stress, etc …
All of this is un-productive and creates the worst environment to tackle the issue in good conditions. To reduce mitigation and risk, a well-structured incident management plan should be prepared. Some articles I read, provide 5, 6 or more stages to be effective during an incident. From my experience, 5 stages are enough:
1/ Detection: monitoring, metrics and thresholds
How do you know something has gone wrong?
It’s one of the hard part due to the multiplicity of systems, technologies and their lifetimes (which gets shorter and shorter).
Be warned in time, easy to say and hard to do.
Alerts must have a reasonable thresholds (too many alerts equals nobody takes care), the right recipient and then the right tool to be called, like Pagerduty or OpsGenie.
You never know what the next incident will be, but if you have the right tool of detection then you will have a deeper understanding of the system and will be far more prepared for the unknown.
2/ Identify metrics to drive improvement
How are alerts delivered, triaged, and escalated?
To reach this goal: process management, communication in the language of the business, ensure cross-silo collaboration and at least incident triage.
First, a clear process must be in place. The alerts or issues are well delivered, a daily incident triage and an Escalation team guardian of the technical platform.
On-call people can be split in 3 different groups:
- The first level receives the first alert of all systems where you can have a business impact or not. Alerts must be well documented and explained. A typical alert failing is when the procedure says to call this team. No way. A good alert typically needs a first action to do. All in one, your process must be aligned. You can use the OODA loop (Observe, Orient/Analyse, Decide, Act) to help you to reach this goal.
- Next level is the level 2 where you can have SRE/Infra or Escalation team at large (people who manage systems, infrastructure and applications). Those 2 teams have a oncall planning and a team rotation.
- Last level, level 3, is for developers team/application owners. Those people are called “on demand” and don’t have a onduty planning.
3/ Remediation: fixes tickets, Mean time to detect, Mean time to resolve
What are the actions taken once an incident is identified?
Quick resolution is the key. You can scale up, rollback, redirect traffic or spot the differences with historical graphs. You need to get immediate help from all staff and process by elimination method, by dividing the target into components.
Here few questions who can help:
- What makes you think there is a performance problem?
- Has this system ever performed well?
- What has changed recently? (Software? Hardware? Load?)
- Can the performance degradation be expressed in terms of latency or run time?
- Does the problem affect other people or applications (or is it just you)?
- What is the environment? What software or hardware is used? Versions? Configuration?
Some useful tools during a firefighting:
- Internal Communications (Slack or HipChat)
- Application Performance and Infrastructure Monitoring (Nagios, Prometheus, …)
- Status Updates (statuspage.io for external customers and/or template email for internal)
- Ticketing Solution (Jira, Confluence)
- Procedure Tracking (Postmortem)
Incident workflow example:
4/ Analysis: postmortem, 5 why, understand, non-reoccurence, process ownership, root cause categorization
How are incidents analyzed following remediation?
Globally, it’s up to 30% unknown root causes causing incident to reoccur. To avoid reoccurrence, you need:
- the root cause,
- the postmortem,
- the postactions.
Here a list of 25 possible root cause and its description:
- Code issue: regression due to a line of code
- Configuration issue: regression due to a configuration change in the application or system
- Integration issue: regression due to an integration/implementation/migration change in the application or the system
- Release validation incorrect/incomplete: the post-check deployment was incorrect/incomplete
- Service outage/degradation: any period of complete disruption or interruption of an IT service
- Hardware outage/degradation: a hardware component has an issue
- Data incorrect/incomplete: data contain misleading, inaccurate or incomplete information
- Network outage/degradation: network equipment X is down or degraded
- Database outage/degradation: database Y is down or degraded
- Access rights outage/degradation: regression on access rights for someone
- Monitoring missing/degradation: issue due to an alert missing or not complete
- ABTest issue: regression due to a bad ABTest configuration created, updated or stopped
- Decommission issue: regression due to a decommission of an application or a server
- Capacity issue: quota for device reached due to volume increase
- By design: a fail but expected
- Document incorrect/incomplete: instruction was incomplete for onduty guy
- Client issue: incident due to a client
- Partner issue: incident due to a partner, external vendor or supplier
- Unpredictable event: theft, sabotage, earthquake, flood
- Issue incorrectly escalated: priority was incorrectly set
- Instructions incorrectly followed: instructions not performed as expected
- Communication issue: issue due to a lack of communication between 2 teams
- Duplicated: issue duplicated with another tracked issue
- Not reproduced: happen one time and can’t be reproduce
- Root cause unknown: the bad, the ugly, the evil root cause
For postmortem and post actions, we use the 5 Whys performance method to go deeply on the problem:
- Given delivered performance, ask, “why?”, then answer this question
- 5 Given previous answer, ask, “why?”, then answer this question
More details on this part can be found here https://medium.com/@kwa29/it-post-mortem-guidelines-77214c6e7e34
5/ Readiness: improvement, game days, learning, freeze period, chaos monkey
What are the processes in place for effective incident response?
Several processes help people having an effective incident response:
- a knowledge base,
- a capacity planning to assess the production capacity needs,
- a continuous improvement culture,
- a deployment frequency with small iterations,
- a frequent training both external and internal,
- some freeze period during key business date (EOY, BlackFriday, Sales, …) to protect the business
- chaos Monkey is a service that randomly terminates VM instances and containers
At last and from my point, the future of Incident Mangement will be a mix of AI and Processes.