Monitor health endpoint for reliable outage detection

To monitor health endpoint checks well, poll an endpoint that proves the app can serve real traffic, not just return a 200. A good setup verifies dependency health, uses a fast timeout, separates liveness from readiness, and alerts only after consecutive failures. That catches real outages earlier and avoids noisy alerts during deploys, restarts, and brief cold starts.

Teams often get this wrong by watching /health and assuming they are covered. In practice, many health routes keep returning success while the database is down, the queue is stuck, or the app cannot complete requests. If you need a broader incident setup, start with this API health monitoring guide.

What the endpoint should prove?

A useful health check should answer one question: can this service handle user traffic right now?

That usually means your endpoint should validate only the dependencies required for the request path you care about. If login depends on the session store and database, test those. If a public landing page only needs the web process, keep that check lighter.

Common things a health route should verify:

web process responds within a small timeout
database connection works with a cheap query
cache or session store is reachable, if required
queue backlog is acceptable, if background jobs are critical
response body is structured, not just a blank 200

The biggest failure pattern is a shallow endpoint that always says "ok". We see this often in generated apps and rushed launches. The route exists, monitoring is green, but the app is effectively down because a required secret, database connection, or upstream service failed after deploy.

A better response includes a clear status and minimal component detail:

json

{
  "status": "ok",
  "checks": {
    "db": "ok",
    "cache": "ok"
  },
  "version": "2026.04.20"
}

Keep the body small. Do not leak stack traces, environment names, or secrets. If you also want to catch exposed config and unsafe routes, run a security scan before launch.

When to monitor health endpoint?

You should poll the route from outside your stack, not only from inside the cluster or host. External monitoring finds DNS issues, TLS failures, bad deploys, broken routing, and WAF mistakes that internal checks miss.

Use these practical defaults for most SaaS apps:

Check every 30 to 60 seconds for critical services.
Set timeout to 2 to 5 seconds.
Alert after three misses, not one.
Recover only after two consecutive passes.
Track both status code and response content.

Those thresholds cut down false positives during short restarts. They also avoid missing real incidents that last only a few minutes.

Where teams go wrong:

polling too slowly, so outages are discovered late
polling too often, creating avoidable load
alerting on one failed probe during deploys
checking only status code, not body content
placing the route behind login or bot protection

If your service has customer-facing flows beyond a simple status route, pair this with external monitoring and synthetic monitoring basics. A green health check does not prove signup, login, or checkout still works.

Separate liveness from readiness

A strong setup uses two endpoints with different jobs: liveness and readiness.

A liveness route answers, "should this process be restarted?" It should stay simple. If the app event loop is alive and can answer quickly, return success. Do not put fragile dependency checks here, or orchestration will keep killing healthy processes during temporary upstream issues.

A readiness route answers, "should traffic be sent here?" This is where you test the dependencies required to serve requests. If the database is unavailable, readiness should fail even if the process is still running.

That split prevents a common outage loop:

app loses database access
single health route fails
orchestrator restarts pods repeatedly
recovery gets slower because instances never stabilize

Use shallow checks for liveness and deep checks for readiness. For public uptime alerts, watch readiness from outside the network. For platform self-healing, use liveness internally.

If you are auditing an AI-built app, this split is worth checking manually. Generated code often ships with one generic /health route that is too shallow for operations and too deep for safe restarts. A more complete deep scan helps catch those mismatches before production traffic hits them.

Build a simple check

Your first version does not need to be complex. It needs to be predictable, cheap, and tied to a real dependency.

Here is a small Node example for a readiness route:

app.get('/ready', async (req, res) => {
  try {
    await db.query('select 1');
    res.status(200).json({ status: 'ok', db: 'ok' });
  } catch {
    res.status(503).json({ status: 'fail', db: 'down' });
  }
});

A few implementation rules matter more than framework choice:

keep the route auth-free for your monitoring system
return 503 on failure, not 200 with an error string
make dependency checks cheap and deterministic
include an expected body so monitors can validate content
avoid checking every third-party API on each probe

Third-party checks deserve extra care. If your app calls an external email or payment provider, do not fail readiness every time that provider slows down unless requests truly cannot be served. Otherwise you create alert storms for a problem users may never notice.

For richer reliability coverage, add flow checks for login, signup, or payments. These complement status polling well. Relevant setups include critical flow monitoring when you need to verify that the app still works end to end.

Alert on failure patterns

The most effective alerting is based on patterns, not single events. A single timeout can be noise. Repeated misses across regions or intervals usually mean a real incident.

Good alert logic usually combines:

probe failures over a short window
latency spikes above a threshold
content mismatch in the response body
regional spread, if you monitor from multiple locations

A concrete example: alert when the readiness route fails 3 out of 5 checks, or when latency stays above 4 seconds for 10 minutes. That is specific enough to wake someone for a real problem, not a transient hiccup.

Also review these operational details:

Do deploys briefly return 503, and is that expected?
Does the monitor follow redirects by mistake?
Is a CDN caching the health response?
Are firewalls or bot rules blocking probes?

These small configuration issues cause a surprising number of blind spots. The best monitors are boring, stable, and easy to reason about.

A health route is only useful if it reflects service reality. Keep it small, test the dependencies that matter, and alert on patterns that signal actual customer impact.

Faq

What should a health endpoint return?

A health route should return a clear status code and a small JSON body. Use 200 when the service is ready and 503 when it is not. Include only the checks needed to prove the app can handle traffic, such as database or cache status, without exposing sensitive internals.

How often should i check a health endpoint?

For most production apps, poll every 30 to 60 seconds with a 2 to 5 second timeout. Alert after three consecutive failures instead of one. That is usually fast enough to catch outages early while filtering noise from deploys, container restarts, and short network hiccups.

Is a health endpoint enough for uptime monitoring?

No. A status route tells you whether the service appears ready, but it does not prove real user flows still work. Pair it with external checks for login, signup, APIs, or payments. That combination finds more customer-impacting incidents than endpoint polling alone.

If you want a quick review before release, AISHIPSAFE can check for exposed risks and weak health check setups with a lightweight scan.