How to build a visual uptime monitor with Go and headless Chrome

May 2026 · Sam Reid

After a production incident where our checkout page was a white screen for 40 minutes while our ping monitor happily reported HTTP 200, I decided to build a visual uptime monitor. The idea is simple: take screenshots of your URLs on a schedule, diff them against a baseline, alert when something looks meaningfully different.

The idea is simple. The implementation has some interesting corners.

This is a writeup of how I built it in Go, what I learned, and the specific gotchas that cost me time. If you want to skip to the end, this became GrabDiff. But the technical path is worth documenting.

The core loop

The system needs to do four things:

On a schedule, load a URL in a real browser and capture a screenshot
Compare that screenshot against a stored baseline image
If the difference exceeds a threshold, send an alert with the diff image attached
Store screenshots and diffs in object storage

Let me walk through each piece.

Taking screenshots with chromedp

Go has a library called chromedp that wraps the Chrome DevTools Protocol (CDP). CDP is the low-level JSON-over-WebSocket protocol that powers Chrome DevTools - the same thing that Puppeteer uses under the hood, just from Go.

The basic screenshot flow looks like this:

func takeScreenshot(ctx context.Context, url string) ([]byte, error) {
    var buf []byte

    err := chromedp.Run(ctx,
        chromedp.Navigate(url),
        chromedp.WaitVisible("body", chromedp.ByQuery),
        chromedp.FullScreenshot(&buf, 100),
    )
    if err != nil {
        return nil, fmt.Errorf("screenshot failed for %s: %w", url, err)
    }

    return buf, nil
}

A few things to unpack here.

WaitVisible("body") waits until the body element is in the DOM. This sounds reasonable but it's actually a pretty weak signal - body exists as soon as the HTML is parsed, which is before any JavaScript runs or any async content loads. For a React or Next.js app, the content you actually care about isn't there yet.

Better options depending on your use case: - chromedp.WaitVisible(".some-content-element") - wait for a specific element you know should be present - A time.Sleep after navigation for a fixed delay (hacky but sometimes pragmatic) - chromedp.ActionFunc with a custom JS evaluation that checks document.readyState === 'complete' and any app-specific signals you can add

I ended up using a configurable post-navigation delay plus a networkIdle equivalent implemented via the CDP's Network domain events. More on that when we get to gotchas.

FullScreenshot(&buf, 100) - that second parameter is quality. Here's where I learned something the chromedp docs don't make obvious.

JPEG vs PNG: this matters more than you'd think

The quality parameter in FullScreenshot controls JPEG compression quality when the output is JPEG. At quality 100, chromedp actually outputs PNG. At quality 90 or below, it outputs JPEG.

This seems like a minor detail. It is not a minor detail for visual diffing.

JPEG compression is lossy. It introduces artifacts - subtle color changes, blocking around high-contrast edges, noise in flat color areas. When you're computing pixel differences between two screenshots taken seconds apart of the same page, JPEG artifacts will show up as differences even when nothing actually changed on the page.

I discovered this when my diff scores were consistently noisy for certain pages - pages with a lot of text on white backgrounds, where JPEG blocking was most visible. The fix was obvious once I understood it: always use PNG for screenshots that will be diffed.

// Always use quality=100 to get PNG output from chromedp
chromedp.FullScreenshot(&buf, 100)

If storage cost is a concern, you can compress the stored baseline and historical screenshots with lossless compression (PNG compression levels, or convert to WebP lossless), but the images that go through the diff pipeline need to be lossless.

The pixel diff algorithm

Computing the difference between two images is conceptually simple. For every pixel, compare the RGB values between image A and image B. Sum up the differences. Normalize by image dimensions.

func pixelDiff(img1, img2 image.Image) (float64, image.Image) {
    bounds := img1.Bounds()
    diffImg := image.NewRGBA(bounds)

    var totalDiff int64
    totalPixels := int64(bounds.Max.X * bounds.Max.Y)

    for y := bounds.Min.Y; y < bounds.Max.Y; y++ {
        for x := bounds.Min.X; x < bounds.Max.X; x++ {
            r1, g1, b1, _ := img1.At(x, y).RGBA()
            r2, g2, b2, _ := img2.At(x, y).RGBA()

            // RGBA() returns values in [0, 65535], shift to [0, 255]
            dr := abs(int(r1>>8) - int(r2>>8))
            dg := abs(int(g1>>8) - int(g2>>8))
            db := abs(int(b1>>8) - int(b2>>8))

            diff := (dr + dg + db)
            totalDiff += int64(diff)

            // Highlight changed pixels in the diff image
            if diff > 10 { // threshold for "meaningfully different pixel"
                diffImg.Set(x, y, color.RGBA{R: 255, G: 0, B: 0, A: 255})
            } else {
                // Dim the unchanged areas to make changed areas stand out
                orig := img1.At(x, y)
                r, g, b, a := orig.RGBA()
                diffImg.Set(x, y, color.RGBA{
                    R: uint8(r >> 9), // ~50% brightness
                    G: uint8(g >> 9),
                    B: uint8(b >> 9),
                    A: uint8(a >> 8),
                })
            }
        }
    }

    // Normalize: max possible diff per pixel is 255*3=765
    // Total max diff is 765 * totalPixels
    score := float64(totalDiff) / float64(765*totalPixels) * 100

    return score, diffImg
}

The diff score is a percentage: 0 means identical, 100 means every pixel is maximally different. In practice, scores above 1-2% are usually meaningful for static content pages. The threshold you want to alert on depends on your content - pages with animations, carousels, or live data need higher thresholds or region masking.

The diff image itself is more useful than the score. Rendering changed pixels as red overlaid on a dimmed version of the original makes it immediately obvious what moved, what disappeared, and what appeared.

One important detail: both images must be the same dimensions. If your page reflows between screenshots - content loaded in that changes the page height, viewport size differences, anything that changes dimensions - your diff will be garbage or error out. I handle this by comparing dimensions before diffing and cropping to the smaller of the two if they differ, though cropping has its own issues. Mostly I prevent this by using a fixed viewport size in chromedp:

chromedp.EmulateViewport(1280, 900, chromedp.EmulateScale(1))

Object storage with B2

Screenshots need to go somewhere. I use Backblaze B2, which is S3-compatible and substantially cheaper than S3 for storage and egress. The Go AWS SDK works with B2 out of the box with a custom endpoint:

cfg, err := config.LoadDefaultConfig(ctx,
    config.WithCredentialsProvider(credentials.NewStaticCredentialsProvider(
        os.Getenv("B2_KEY_ID"),
        os.Getenv("B2_APPLICATION_KEY"),
        "",
    )),
    config.WithRegion("auto"),
    config.WithEndpointResolverWithOptions(
        aws.EndpointResolverWithOptionsFunc(func(service, region string, options ...interface{}) (aws.Endpoint, error) {
            return aws.Endpoint{URL: "https://s3.us-west-004.backblazeb2.com"}, nil
        }),
    ),
)

I store three categories of images: - Baselines: the reference screenshot for each monitor, updated when the user explicitly accepts a new baseline - Latest: the most recent screenshot, always overwritten - Diffs: generated only when a threshold breach occurs, kept for the alert email

For the diff images I generate a pre-signed URL with a 7-day expiry for the alert email rather than embedding the full image inline, though for email clients that block remote images I also attach the diff as a base64-encoded attachment.

Job scheduling with River

Scheduling screenshot jobs at per-monitor intervals (30 minutes for free plan, 5 minutes for Solo, 1 minute for Pro) needs a proper job queue - not cron, not a ticker in a goroutine.

I use River, which is a Postgres-backed job queue for Go. It handles scheduling, retries, concurrency limits, and observability. For a solo-built SaaS, having the queue state in the same Postgres database as everything else is a meaningful operational simplification over running a separate Redis instance.

A monitor job looks roughly like this:

type ScreenshotJobArgs struct {
    MonitorID int64  `json:"monitor_id"`
    URL       string `json:"url"`
}

func (ScreenshotJobArgs) Kind() string { return "screenshot" }

type ScreenshotWorker struct {
    river.WorkerDefaults[ScreenshotJobArgs]
    db      *pgxpool.Pool
    storage *StorageClient
    mailer  *MailClient
}

func (w *ScreenshotWorker) Work(ctx context.Context, job *river.Job[ScreenshotJobArgs]) error {
    screenshot, err := takeScreenshot(ctx, job.Args.URL)
    if err != nil {
        return err // River will retry with backoff
    }

    baseline, err := w.storage.GetBaseline(ctx, job.Args.MonitorID)
    if err != nil && !errors.Is(err, ErrNoBaseline) {
        return err
    }

    if baseline == nil {
        // First run, set as baseline
        return w.storage.SetBaseline(ctx, job.Args.MonitorID, screenshot)
    }

    score, diffImg := pixelDiff(decodeImage(baseline), decodeImage(screenshot))

    if score > getThresholdForMonitor(job.Args.MonitorID) {
        diffBytes := encodePNG(diffImg)
        return w.mailer.SendAlert(ctx, job.Args.MonitorID, score, diffBytes)
    }

    return nil
}

River's periodic job scheduling means I insert a job for each active monitor when it's due, rather than trying to manage timing myself. It also handles the case where a worker crashes mid-job - the job stays in the queue and gets picked up again.

The gotchas

I want to document the things that actually cost me time, because the basic approach sounds simple and the failure modes are not obvious.

Chrome in Docker networking

Running chromedp inside a Docker container has a networking surprise: by default, Chrome's sandbox configuration conflicts with Docker's user namespace remapping. The symptom is that chromedp.Run() hangs indefinitely or returns mysterious errors.

The fix is --no-sandbox when running in Docker:

opts := append(chromedp.DefaultExecAllocatorOptions[:],
    chromedp.Flag("no-sandbox", true),
    chromedp.Flag("disable-dev-shm-usage", true), // /dev/shm is often small in Docker
    chromedp.Flag("disable-gpu", true),
)
allocCtx, cancel := chromedp.NewExecAllocator(context.Background(), opts...)
defer cancel()

--disable-dev-shm-usage is also important. Chrome uses /dev/shm (shared memory) for rendering, and in many Docker setups /dev/shm is only 64MB, which causes Chrome to crash or produce corrupted screenshots for complex pages.

Navigation timing

chromedp.Navigate() resolves when the browser has received and parsed the initial HTML response. For SPAs and SSR frameworks with client-side hydration, you need to wait longer. I settled on a two-phase wait:

Wait for the load event (equivalent to window.onload)
Add a fixed post-load delay (configurable per monitor, default 2 seconds)

For pages where I know the content structure, waiting for a specific selector is more reliable than a fixed delay, but it requires per-monitor configuration that adds UX complexity.

Memory management

Chrome instances accumulate memory. If you're running many concurrent screenshot jobs, you'll see memory climb and stay there. I use a pool of Chrome contexts with a maximum concurrency limit and reset contexts periodically:

var chromeSemaphore = make(chan struct{}, 5) // max 5 concurrent Chrome instances

Each job acquires the semaphore before allocating a Chrome context and releases it after. This bounds total Chrome memory usage at the cost of queuing.

Alert spam

When a page breaks, every check for the next 30 minutes (or 5 minutes, or 1 minute) will trigger a diff above threshold. Without deduplication, your inbox fills up with the same alert repeatedly.

I track alert state in the database: when an alert fires, record the timestamp and suppress subsequent alerts for the same monitor until either (a) the diff returns below threshold, meaning it healed, or (b) some cooldown period passes. I also send a "resolved" notification when the diff goes back below threshold, which closes the loop.

SSL, domain expiry, and heartbeats

While I had the scheduling infrastructure in place, I added two more monitor types that keep causing production incidents.

SSL/TLS expiry: Parse the certificate chain from a TLS handshake, check NotAfter, alert at 30 days and 7 days. The Go crypto/tls package makes this straightforward. The number of times I've seen a team's cert expire because someone forgot to renew it (or LetsEncrypt auto-renewal silently failed) makes this worth having.

Domain expiry via WHOIS: WHOIS parsing is a pain because there's no standard format across registrars. I use a library and handle the common cases, falling back to alerting if the query fails (better to false-positive on a WHOIS timeout than to miss an actual expiry).

Cron heartbeats: You register a URL endpoint, and your cron job hits it on each successful run. If GrabDiff doesn't see a heartbeat within the expected interval plus a grace period, it alerts. This catches the case where your cron job silently stops running - a failure mode that's invisible to every other monitoring type.

What I'd do differently

A few things I'd approach differently if starting over:

Separate Chrome pool process: Rather than spawning Chrome from the worker process, I'd run a dedicated Chrome pool as a separate service that workers call via an internal HTTP API. This makes Chrome lifecycle management cleaner and lets you scale Chrome capacity independently of worker capacity.

Region distribution: Taking screenshots from a single region means you're missing CDN edge cache issues that only affect certain geographies. I have this on the roadmap but haven't built it yet.

Smarter baseline management: The current "user explicitly accepts a new baseline" model means legitimate page changes cause alerts until you update the baseline. Automatic baseline updating with a confirmation step would reduce friction.

This became GrabDiff

After building all of this, I packaged it into GrabDiff. Visual diff monitoring, SSL monitoring, domain expiry, cron heartbeats. Free plan covers three monitors. If you want more monitors or faster intervals, there are paid plans.

If you want to run your own version of this, everything I've described above is implementable in a few weekends. The chromedp library is solid, River is excellent for job scheduling, and B2 is cheap enough that storage costs are negligible for a personal project.

But if you'd rather not run the infra, GrabDiff is there.

Context on why this matters: Read why traditional uptime monitors miss so many real failures for the product story behind GrabDiff. Or start a free trial —