Why Default Logging Isn't Enough
Console logs and basic uptime checks tell you a service is alive. They don't tell you it's slow, leaking memory, or that one specific endpoint has a 20% error rate. That gap is where Prometheus and Grafana come in.
- ▸Logs are text — hard to query trends, percentiles, or rates across services
- ▸Uptime checks only fire when a service is completely dead, not when it's degraded
- ▸Without histograms you can't see p95/p99 latency — averages hide tail latency problems
- ▸Cross-service correlation is impossible without a shared metrics store
Prometheus gives you a time-series metrics store. Grafana gives you dashboards and alerts on top of it. Together they answer: what is my service doing right now, and when did it start going wrong?
Architecture Overview
The full stack is three components alongside your services:
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Service A │ │ Service B │ │ Service C │
│ :3000/metrics│ │ :3000/metrics│ │ :3000/metrics│
└──────┬───────┘ └──────┬───────┘ └──────┬───────┘
│ │ │
└──────────────────┼───────────────────┘
▼
┌──────────────────┐
│ Prometheus │ scrapes /metrics every 15s
│ :9090 │ stores time-series data
└────────┬─────────┘
│
┌───────────┴───────────┐
▼ ▼
┌────────────┐ ┌──────────────────┐
│ Grafana │ │ Alertmanager │
│ :3000 │ │ :9093 │
└────────────┘ └────────┬─────────┘
│
Slack / PagerDutyEach Node.js service exposes a /metrics endpoint in Prometheus text format. Prometheus scrapes all services on a regular interval and stores the data. Grafana queries Prometheus for dashboards. Alertmanager handles routing — warnings go to Slack, critical alerts page on-call.
Step 1 — Instrument Your Node.js Service
Install the official Prometheus client:
npm install prom-clientCreate a metrics module. The default metrics give you Node.js process stats for free — CPU usage, heap, event loop lag, GC pauses. Add custom counters and histograms for your HTTP layer on top:
// metrics.ts
import client from "prom-client";
const register = new client.Registry();
// built-in: CPU, memory, event loop lag, GC
client.collectDefaultMetrics({ register, prefix: "node_" });
// custom: HTTP request counter
export const httpRequestsTotal = new client.Counter({
name: "http_requests_total",
help: "Total number of HTTP requests",
labelNames: ["method", "route", "status_code"],
registers: [register],
});
// custom: request latency histogram
export const httpRequestDuration = new client.Histogram({
name: "http_request_duration_seconds",
help: "HTTP request latency in seconds",
labelNames: ["method", "route", "status_code"],
buckets: [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5],
registers: [register],
});
export { register };Wire it into your Express app with a middleware and expose the scrape endpoint:
// app.ts
import express from "express";
import { register, httpRequestsTotal, httpRequestDuration } from "./metrics";
const app = express();
app.use((req, res, next) => {
const end = httpRequestDuration.startTimer();
res.on("finish", () => {
const labels = {
method: req.method,
route: req.route?.path ?? req.path,
status_code: String(res.statusCode),
};
httpRequestsTotal.inc(labels);
end(labels);
});
next();
});
// Prometheus scrape endpoint
app.get("/metrics", async (_req, res) => {
res.set("Content-Type", register.contentType);
res.send(await register.metrics());
});
app.listen(3000);The /metrics endpoint should not be publicly reachable. Keep it on an internal network and ensure only the Prometheus server can reach it — not your public load balancer.
Step 2 — Configure Prometheus
Prometheus is configured with a single YAML file that tells it what to scrape and how often. The simplest setup uses static targets — you list each service by hostname or IP:
# prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
alerting:
alertmanagers:
- static_configs:
- targets: ["alertmanager:9093"]
rule_files:
- "alerts.yml"
scrape_configs:
- job_name: "service-a"
static_configs:
- targets: ["service-a:3000"]
metrics_path: /metrics
- job_name: "service-b"
static_configs:
- targets: ["service-b:3000"]
metrics_path: /metricsRun Prometheus with Docker alongside your services:
# docker-compose.yml (relevant excerpt)
services:
prometheus:
image: prom/prometheus:v2.51.0
ports:
- "9090:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- ./alerts.yml:/etc/prometheus/alerts.yml
- prometheus-data:/prometheus
command:
- "--config.file=/etc/prometheus/prometheus.yml"
- "--storage.tsdb.path=/prometheus"
- "--storage.tsdb.retention.time=15d"
volumes:
prometheus-data:For dynamic environments where services come and go (Kubernetes, Nomad, Consul), replace static_configs with service discovery — Prometheus has native integrations for all of them.
Step 3 — Define Alert Rules
Alert rules live in alerts.yml. These are the four I add to every Node.js service from day one:
# alerts.yml
groups:
- name: nodejs-service
rules:
# Service is completely unreachable
- alert: ServiceDown
expr: up == 0
for: 1m
labels:
severity: critical
annotations:
summary: "{{ $labels.job }} is down"
description: "Prometheus cannot scrape {{ $labels.job }}. Check if the process is running."
# p95 latency above 500ms sustained for 5 minutes
- alert: HighLatency
expr: |
histogram_quantile(0.95,
rate(http_request_duration_seconds_bucket[5m])
) > 0.5
for: 5m
labels:
severity: warning
annotations:
summary: "High p95 latency on {{ $labels.job }}"
description: "p95 latency is {{ $value | humanizeDuration }} — threshold is 500ms"
# More than 5% of requests returning 5xx
- alert: HighErrorRate
expr: |
rate(http_requests_total{status_code=~"5.."}[5m])
/
rate(http_requests_total[5m]) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate on {{ $labels.job }}"
description: "{{ $value | humanizePercentage }} of requests are 5xx errors"
# Heap memory above 80% for 10 minutes — heading toward OOM
- alert: HighHeapUsage
expr: |
node_nodejs_heap_size_used_bytes
/
node_nodejs_heap_size_total_bytes > 0.8
for: 10m
labels:
severity: warning
annotations:
summary: "High heap on {{ $labels.job }}"
description: "Heap is at {{ $value | humanizePercentage }} — investigate memory leak or scale out"Step 4 — Configure Alertmanager
Alertmanager receives firing alerts from Prometheus and routes them. Warnings go to Slack, critical alerts go to PagerDuty:
# alertmanager.yml
global:
resolve_timeout: 5m
slack_api_url: "https://hooks.slack.com/services/YOUR/WEBHOOK/URL"
route:
group_by: ["alertname", "job"]
group_wait: 30s # wait 30s to batch related alerts
group_interval: 5m
repeat_interval: 4h
receiver: slack-warnings
routes:
- match:
severity: critical
receiver: pagerduty-critical
receivers:
- name: slack-warnings
slack_configs:
- channel: "#alerts-infra"
title: "⚠️ {{ .GroupLabels.alertname }}"
text: "{{ range .Alerts }}{{ .Annotations.description }}
{{ end }}"
send_resolved: true
- name: pagerduty-critical
pagerduty_configs:
- routing_key: "YOUR_PAGERDUTY_INTEGRATION_KEY"
description: "{{ .GroupLabels.alertname }} — {{ .CommonAnnotations.summary }}"
send_resolved: trueRun Alertmanager alongside Prometheus:
# docker-compose.yml (excerpt)
alertmanager:
image: prom/alertmanager:v0.27.0
ports:
- "9093:9093"
volumes:
- ./alertmanager.yml:/etc/alertmanager/alertmanager.ymlStep 5 — Set Up Grafana
Run Grafana and provision Prometheus as a data source via config file so it persists across container restarts:
# grafana/provisioning/datasources/prometheus.yml
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
url: http://prometheus:9090
isDefault: true
editable: false# docker-compose.yml (excerpt)
grafana:
image: grafana/grafana:10.4.0
ports:
- "3001:3000"
volumes:
- ./grafana/provisioning:/etc/grafana/provisioning
- grafana-data:/var/lib/grafana
environment:
- GF_SECURITY_ADMIN_PASSWORD=changemeThe three dashboards I provision for every Node.js deployment:
- ▸Service Health Overview — up/down per service, request rate, error rate, p50/p95/p99 latency
- ▸Node.js Runtime — heap used vs total, event loop lag, GC pause duration, active handles
- ▸Infrastructure — CPU and memory per host or container
Import dashboard ID 11159 (Node.js Application Dashboard) from grafana.com as a starting point — it covers all default prom-client metrics without any extra config.
Key PromQL Queries to Bookmark
These are the queries I reach for first when investigating an incident:
# Requests per second
rate(http_requests_total[1m])
# Error rate as a percentage
rate(http_requests_total{status_code=~"5.."}[5m])
/ rate(http_requests_total[5m]) * 100
# p95 latency
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
# Event loop lag — high value means the thread is blocked
node_nodejs_eventloop_lag_seconds
# Heap usage percentage
node_nodejs_heap_size_used_bytes / node_nodejs_heap_size_total_bytes * 100
# Active HTTP connections
node_http_server_connections_totalProduction Tips
- ▸Use a named Docker volume for Prometheus data — anonymous volumes are wiped when the container is removed
- ▸Set scrape_interval to 15s for most services; drop to 5s only for critical paths (payments, auth) where faster alerting is worth the extra storage
- ▸Add a silence in Alertmanager before planned deployments — prevents alert fatigue from expected restarts
- ▸Store Slack webhook URLs and PagerDuty keys in environment variables or a secrets manager, not hardcoded in alertmanager.yml
- ▸Enable Prometheus remote_write to a long-term store (Thanos, VictoriaMetrics) if you need more than 15 days of history for capacity planning
Wrapping Up
The full stack — prom-client in each service, Prometheus scraping them, Alertmanager routing to Slack and PagerDuty, Grafana for dashboards — runs entirely in Docker and takes half a day to set up the first time. After that, onboarding a new service means adding one job to prometheus.yml and dropping the metrics middleware into the new app.
The payoff: instead of finding out a service is broken when a user reports it, you get a Slack message the moment error rate crosses 5% or heap is climbing toward OOM — with a direct link to the Grafana panel showing exactly when and why it started.