Bad Devop: Why Pagerduty posting to Slack was a terrible idea

Danger sign
Posting Pagerduty incidents to a Slack channel is a really bad idea and goes against dev ops best practices. Here I go over why it’s a bad idea, why I did it and what I do now instead.

 

Why is it a bad idea?

Initially, I’d setup Pagerduty to post to the main dev channel. Within a day or two, everyone on the channel would get tired of the notifications interrupting conversations and I’d move the alerts to its own dedicated channel. From there, it’s only a week or so before I’d stop proactively checking that channel and boom, I’m back where I started.

 

The problem was always noise. Too many alerts, I’m going to start ignoring the channel. Too many un-actionable alerts? Yup, going to start ignoring.

 

If an event comes up that needs action, I need to know that someone has seen it and taken ownership. And it’s far too easy for a Slack message to be pushed out of view by newer messages. Disappearing forever without anyone having looked at it, let alone taking ownership of it.

 

You don’t need a chat client to manage alerts, you need an issue tracker.

 

Why did I do it then?

When configuring my alerts and thresholds, I was always deciding on how important an event is.  Wake me up in the middle of the night? Make sure that someone fixes within a few days? Ignore? Deciding on a threshold is hard – most of the time you’re taking an educated guess. Slack seemed like an easy shortcut, throw everything in there and I’ll go through them and find the important ones later. It took me a while to figure out that this always created more noise than I could handle.

 

What do I do now instead?

I have 3 policies for incidents, “wake me up”, “create a ticket” and “ignore”.  Every alert has to be assigned to one of these levels. There’s no in between.

 

I have alerts set for symptoms, not causes. I don’t care if my database is using a lot of CPU or if AWS is having a disruption of service affecting some servers in my region. I care about my app’s response times and error rates.

 

I have a budget for error rates. For some (ultra rare) things I need 99.999% reliability, but for most parts of an app that percentage is a lot, lot lower. How low depends on your business, how impactful it is on end users, etc. So for each component of an app, I have an error budget – I only get alerted if the error rate blows that budget.

 

Similarly, I have my metric thresholds (page response times, queue capacity) set as high as possible. The cost to waking someone up in the middle of the night is extremely high. I want to be 100% sure that this alert is worth it. I can always lower the thresholds later as I get a better idea of what are reasonable values.

 

 

To help fix the problem of noisy alerts at MeetEdgar, we developed an alternative to PagerDuty. Ropig is a tool being built from the ground up to follow Google’s site reliability engineering best practices. We support any service that can send an alert via webhook (which is 99.9% of them). Sign up below for early access.

Send this to a friend