What I actually monitor in production (and why)

May 2026 · Sam Reid

I've been running production web services long enough to have been burned by most of the major failure modes at least once. Over time I've settled on a monitoring setup that I feel confident in — one that means I find out about problems before users do most of the time, and quickly after when I don't.

Here's what I actually monitor and why each piece is there.

HTTP uptime checks

The baseline. Every externally-facing URL gets an HTTP check every minute or faster. This catches the hard failures: server crashed, out of disk, misconfigured proxy, accidental firewall rule. When these fire, something is seriously wrong and I need to know immediately.

I don't rely on these as my only check. See below.

Visual screenshot monitoring

For every URL that actual users interact with — homepage, login, signup, dashboard, main product pages — I also run visual checks using GrabDiff. It loads the URL in a headless browser, takes a screenshot, and compares it to a baseline.

This catches what HTTP can't: JavaScript crashes, blank screens, broken layouts after a CSS deploy, API-dependent content that silently stops loading. I've been caught by this class of failure more than once. HTTP said up. Users saw nothing.

The check interval for visual monitoring is slower than HTTP checks — 30 minutes is fine for most things because these failures tend to persist rather than flicker. But the coverage is different and both are needed.

SSL certificate expiry

Alert at 30 days. Alert again at 14. Alert again at 7. I've never let a certificate expire since I started doing this. The alert at 30 days gives enough time to investigate if auto-renewal is broken without any urgency. By 7 days, if it's still not renewed, it's a real fire.

GrabDiff tracks this automatically for every URL I've added, so I don't need a separate service.

Domain expiry

Same principle as SSL but for domain registration. Alert at 60 days — longer runway because registrar processes can be slow and you sometimes need to update payment methods. A lapsed domain takes everything down: no DNS, no email, no nothing. Worth the extra vigilance.

Cron job heartbeats

Every scheduled job — database backups, invoice generation, report emails, cleanup scripts — has a heartbeat ping at the end of its success path. If the ping doesn't arrive within the expected window, I get alerted.

Before I set this up, I had jobs fail silently for days at a time. Backups were "running" but producing empty files. Reports stopped going out. Nobody noticed until someone asked where their email was.

This is the monitoring category most people skip. Don't skip it.

Error rates from application logs

I run a simple alert on 5xx error rate in my application logs. If error rate spikes above a threshold, I want to know. This catches partial failures — specific endpoints throwing, not the whole server being down — that uptime checks might miss if the failure is isolated to one route.

What I don't monitor (and why)

I don't monitor every individual API endpoint with its own uptime check. The overhead of maintaining dozens of checks for internal endpoints isn't worth it for my use case — if the app is up and the key pages look right, the internals are probably fine.

I don't monitor third-party services directly. If Stripe's API is slow, that's Stripe's status page problem. I can detect the downstream effect (payment page broken) through visual monitoring without trying to monitor things I don't control.

The overall philosophy: cover the things users actually interact with, use different check types for different failure modes, and make sure scheduled jobs aren't silent black boxes.