Your Service Should Make a Sound When it Flatlines

Christian Emmer

Dec 18, 2025 · 3 min read

When (not if) your service dies, do you have an EKG to tell you so?

It's important to be alerted when the processes you are responsible for are no longer having their expected outcome. Most of the time you want these alerts to be driven by well-defined SLOs. But an obvious failure mode that will lead to breaking SLOs are if a service flatlines, or it is no longer running or making any progress.

SLIs, SLOs, and SLAs: What Are They?

May 11, 2020 · 5 min read

You might have heard these terms in reference to commitments with vendors or customers, but what are they, and why should you care?

Example scenarios

Some examples of flatlining include:

Failure to start

Your service runs on Kubernetes and its pods can't be scheduled or are in a CrashLoopBackOff
Your service isn't being provided environment variables or other configuration it depends on to start

Failure to be reached

Clients of your service can't reach it because of a problem with service discovery or DNS resolution

Failure to make progress

Your service isn't releasing network connections or threads, causing other threads to wait forever
Your service is deadlocking due to conflicting mutexes
Your service is repeatedly consuming a "poison pill" message from a queue, preventing progress

Failure to stay alive

Your service is overwhelmed with data, causing it to run out of memory
Your service is running on a server with a full disk, or it fills a disk every time it starts

Some of these situations can be tricky to detect, but it is imperative that you try to.

Example metrics

Here are some common failure metrics to monitor, that when they pass a certain threshold, are likely to cause an SLO to be broken:

Service orchestration errors: Kubernetes or other compute orchestrators should emit metrics when they fail to schedule services or when services fail to start.
Out-of-memory exceptions: if a service runs out of memory multiple times in a row then it is likely to keep doing so.
Large message queue depth (backlog): if a message queue's depth rises above a certain threshold then it may indicate a service cannot make progress.
Most durable message queues track the queue depth in terms of a message count, but some message queues may also track the age of the oldest message. Only rely on message age metrics being emitted by the broker, not your service! Because if your service doesn't consume anything then it can't emit a metric, which may cause a false positive of health.
Messages produced to a dead-letter queue (DLQ): if a message fails to be processed multiple times, or within some reasonable timespan, then they should be routed to a DLQ to keep them from blocking other messages in the queue.
If you aren't using a DLQ, then you probably should, but you may instead be able to monitor a high message queue redelivery rate.
Errors observed by clients: if clients of your service observe an abnormal rate of timeouts or errors, then it may indicate a systemic problem with your service or with the network in between.