When everything blows up

Production issues will happen. How can you find some order in the chaos?

Image for post
Image for post
Oh dear… (Source: ROMEO nuclear test, Atomic Bomb Test Photos)
Image for post
Image for post
It isn’t really, little coffee dude.

Chaos is normal

If you are working at a SaaS company then I will make you a bet: something is going to go horribly wrong in production this year.

Defining those moments of panic

To the disorganized or inexperienced, these chaotic moments of panic are truly awful. You don’t know what to do first, you’re overwhelmed, and your unease transmits to those around you, causing a swell, then a breaking wave, of paranoia. However, these moments can feel saner with a simple process wrapped around them to guide you through the mess.

  • A problem is the unknown cause of one or more incidents, often identified as the result of multiple similar incidents.
  1. You need to do your best to track and log incidents so that they don’t turn into problems, or in the case that they do, those problems are kept as small as possible.

Incident management

When things blow up, there’s three hats that need to be worn by the staff that are working on it:

  1. The communicator, who is responsible for broadcasting to the business what is going on at regular intervals.
  2. The technical expert, who is identifying and fixing the issue.

A playbook for incidents

Define a playbook to follow when an incident occurs. A simple one could look like the following:

  1. A decision on the means of communication between those working on the incident. Typically we use Slack for this, but any way is fine as long as it works for you.
  2. Communication to the rest of the business that an incident is occurring: what it is, what’s being done about it, and what the regularity of the updates are going to be. Typically we give updates every thirty minutes via Slack. Additionally, you should update your customer-facing status page if required, and send out notifications to customers if it is deemed necessary.
  3. Regular internal communication documenting what is being done to recover from the incident. This includes any major decisions that have been made. This can be used later to review the incident and learn from it.
  4. The continuation of steps 3–4 until the incident is fixed. When it is fixed, the business should be notified and the work that resolved the issue should be documented.
  5. Scheduling a 5 Whys postmortem to get to the root of incident and decide on actions to prevent it in the future.

After an incident

As specified in point 6 above, once an incident is over and the service has been restored, it is useful to run a 5 Whys session. These are well-documented elsewhere on the Web, so I won’t go into detail about how to run one. However, it is important to note that you do not want incidents to turn into problems that go unfixed long term.

In summary

Don’t just treat incidents as annoying things getting in the way of doing your real work. Take them seriously and do the work to make them happen less in the future; you don’t want them to turn into problems that drive your customers away.

VP Engineering @brandwatch. Writing things that interest me. Hopefully they'll interest you as well.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store