Story 4: I’m an Incident Commander

I was working as a team leader for a while now in different company.  That made me take a look at different processes and organisation for an incident management. Often and more often on startup company, the incident management didn’t exist or was very tiny. The result was almost the same, people run in any sense and avoid to mitigate the risk as soon as possible. To draw a quick and, I’m pretty sure, a common experience: we faced a big issue with reputation or revenue risk. People go on any direction without guidance and pragmatism. The result is pretty the same at the end, we lost time in a stressful environment.

Actions taken

To reduce the risk, I often put the same combo in my teams: MRO and Incident Commander.
MRO for Maintenance, Repair and Operations also called Interrupt people, is a dedicated person in the day to manage incident. This role is turning every days. Rotations can be seen by a calendar or by Slack with a bot every at 9am. Each day, a new MRO people is designated. He/She must check if any incident happened during the night or day (tickets or new emails, slack channels like mro, interrupt, incident_room or sre). He/She doesn’t work on sprint tickets or projects during this day. He/She firstly treat alerts and incidents then user questions by Slack or emails. For sure, automated alerts should be the highest priority. At the end of the day, he/she made the handover for remote colleague. It is also important to note that the MRO people can be on duty for the night. In this case, you should avoid assigning MRO to the next day.
Then, the Incident Commander is the person responsible to manage all people during an emergency incident. This role owns a management aspect and also technical to follow and propose ideas to mitigate. He often made the incident communication and synchronisation effort. He/She should not be performing any actions or remediations, checking graphs or investigating logs. Those tasks must be delegated to experts in their field.

Lessons learned

If you put in place a beginning of incident management, you will probably see people less stressful and sometimes happy to manage an issue. I always keep in mind: less stress, more strength to face an issue, faster to mitigate.



