Problem

More and more often with startup companies, incident management doesn’t exist or is very minimal. The result is almost always the same: people operate in varying ways and neglect to mitigate the risk as soon as possible.

To draw a quick and, I’m pretty sure, a common experience: on my team, we faced a big issue with reputation or revenue risk. People worked without any guidance or pragmatism; and we ended up losing time and fostering a stressful environment. As team lead, I was looking at different processes and organization for incident management. The following is my current process.

Actions taken

To reduce the risk, I enlist someone from my team to serve as Maintenance, Repair and Operations (MRO) and Incident Commander. This is a person dedicated to managing incidents over the period of one day. This role rotates from person to person daily. Rotations can be organized via calendar or Slack with a bot.

Each day at 9:00am, a new MRO person is designated. They must check if any incidents occurred during the night or throughout that day (tickets or new emails, slack channels like MRO, Interrupt, Incident Room or SRE). They don’t work on sprint tickets or projects during this day. They firstly treat alerts and incidents, and then the user questions on Slack or by email. Of course, automated alerts should be the highest priority. At the end of the day, they handover to a remote colleague. It is also important to note that the MRO people can be on duty throughout night. In this case, you should avoid assigning MRO until the next day.

Additionally, the Incident Commander is the person responsible to manage the team during an emergency incident. This role requires a management aspect and also technical skills to follow and propose ideas to mitigate. They often make the incident communication and synchronization efforts. They should not be performing any actions or remediations, checking graphs or investigating logs - those tasks must be delegated to experts in their field.

Lessons learned

If you put in place a form of incident management proactively, you will probably see people are less stressed and sometimes happy to manage an issue. I always keep in mind: with less stress and more strength to face an issue, you are faster to mitigate.