Monitoring Node.js Microservices with Prometheus and Grafana

Why Default Logging Isn't Enough

Console logs and basic uptime checks tell you a service is alive. They don't tell you it's slow, leaking memory, or that one specific endpoint has a 20% error rate. That gap is where Prometheus and Grafana come in.

▸Logs are text — hard to query trends, percentiles, or rates across services
▸Uptime checks only fire when a service is completely dead, not when it's degraded
▸Without histograms you can't see p95/p99 latency — averages hide tail latency problems
▸Cross-service correlation is impossible without a shared metrics store

Prometheus gives you a time-series metrics store. Grafana gives you dashboards and alerts on top of it. Together they answer: what is my service doing right now, and when did it start going wrong?

Architecture Overview

The full stack is three components alongside your services:

  ┌──────────────┐   ┌──────────────┐   ┌──────────────┐
  │  Service A   │   │  Service B   │   │  Service C   │
  │ :3000/metrics│   │ :3000/metrics│   │ :3000/metrics│
  └──────┬───────┘   └──────┬───────┘   └──────┬───────┘
         │                  │                   │
         └──────────────────┼───────────────────┘
                            ▼
                  ┌──────────────────┐
                  │    Prometheus    │  scrapes /metrics every 15s
                  │    :9090         │  stores time-series data
                  └────────┬─────────┘
                           │
               ┌───────────┴───────────┐
               ▼                       ▼
        ┌────────────┐       ┌──────────────────┐
        │   Grafana  │       │  Alertmanager    │
        │   :3000    │       │  :9093           │
        └────────────┘       └────────┬─────────┘
                                      │
                               Slack / PagerDuty

Each Node.js service exposes a /metrics endpoint in Prometheus text format. Prometheus scrapes all services on a regular interval and stores the data. Grafana queries Prometheus for dashboards. Alertmanager handles routing — warnings go to Slack, critical alerts page on-call.

Step 1 — Instrument Your Node.js Service

Install the official Prometheus client:

npm install prom-client

Create a metrics module. The default metrics give you Node.js process stats for free — CPU usage, heap, event loop lag, GC pauses. Add custom counters and histograms for your HTTP layer on top:

// metrics.ts
import client from "prom-client";

const register = new client.Registry();

// built-in: CPU, memory, event loop lag, GC
client.collectDefaultMetrics({ register, prefix: "node_" });

// custom: HTTP request counter
export const httpRequestsTotal = new client.Counter({
  name: "http_requests_total",
  help: "Total number of HTTP requests",
  labelNames: ["method", "route", "status_code"],
  registers: [register],
});

// custom: request latency histogram
export const httpRequestDuration = new client.Histogram({
  name: "http_request_duration_seconds",
  help: "HTTP request latency in seconds",
  labelNames: ["method", "route", "status_code"],
  buckets: [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5],
  registers: [register],
});

export { register };

Wire it into your Express app with a middleware and expose the scrape endpoint:

// app.ts
import express from "express";
import { register, httpRequestsTotal, httpRequestDuration } from "./metrics";

const app = express();

app.use((req, res, next) => {
  const end = httpRequestDuration.startTimer();
  res.on("finish", () => {
    const labels = {
      method: req.method,
      route: req.route?.path ?? req.path,
      status_code: String(res.statusCode),
    };
    httpRequestsTotal.inc(labels);
    end(labels);
  });
  next();
});

// Prometheus scrape endpoint
app.get("/metrics", async (_req, res) => {
  res.set("Content-Type", register.contentType);
  res.send(await register.metrics());
});

app.listen(3000);

SECURITY

The /metrics endpoint should not be publicly reachable. Keep it on an internal network and ensure only the Prometheus server can reach it — not your public load balancer.

Step 2 — Configure Prometheus

Prometheus is configured with a single YAML file that tells it what to scrape and how often. The simplest setup uses static targets — you list each service by hostname or IP:

# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

alerting:
  alertmanagers:
    - static_configs:
        - targets: ["alertmanager:9093"]

rule_files:
  - "alerts.yml"

scrape_configs:
  - job_name: "service-a"
    static_configs:
      - targets: ["service-a:3000"]
    metrics_path: /metrics

  - job_name: "service-b"
    static_configs:
      - targets: ["service-b:3000"]
    metrics_path: /metrics

Run Prometheus with Docker alongside your services:

# docker-compose.yml (relevant excerpt)
services:
  prometheus:
    image: prom/prometheus:v2.51.0
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - ./alerts.yml:/etc/prometheus/alerts.yml
      - prometheus-data:/prometheus
    command:
      - "--config.file=/etc/prometheus/prometheus.yml"
      - "--storage.tsdb.path=/prometheus"
      - "--storage.tsdb.retention.time=15d"

volumes:
  prometheus-data:

TIP

For dynamic environments where services come and go (Kubernetes, Nomad, Consul), replace static_configs with service discovery — Prometheus has native integrations for all of them.

Step 3 — Define Alert Rules

Alert rules live in alerts.yml. These are the four I add to every Node.js service from day one:

# alerts.yml
groups:
  - name: nodejs-service
    rules:

      # Service is completely unreachable
      - alert: ServiceDown
        expr: up == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "{{ $labels.job }} is down"
          description: "Prometheus cannot scrape {{ $labels.job }}. Check if the process is running."

      # p95 latency above 500ms sustained for 5 minutes
      - alert: HighLatency
        expr: |
          histogram_quantile(0.95,
            rate(http_request_duration_seconds_bucket[5m])
          ) > 0.5
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High p95 latency on {{ $labels.job }}"
          description: "p95 latency is {{ $value | humanizeDuration }} — threshold is 500ms"

      # More than 5% of requests returning 5xx
      - alert: HighErrorRate
        expr: |
          rate(http_requests_total{status_code=~"5.."}[5m])
          /
          rate(http_requests_total[5m]) > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High error rate on {{ $labels.job }}"
          description: "{{ $value | humanizePercentage }} of requests are 5xx errors"

      # Heap memory above 80% for 10 minutes — heading toward OOM
      - alert: HighHeapUsage
        expr: |
          node_nodejs_heap_size_used_bytes
          /
          node_nodejs_heap_size_total_bytes > 0.8
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "High heap on {{ $labels.job }}"
          description: "Heap is at {{ $value | humanizePercentage }} — investigate memory leak or scale out"

Step 4 — Configure Alertmanager

Alertmanager receives firing alerts from Prometheus and routes them. Warnings go to Slack, critical alerts go to PagerDuty:

# alertmanager.yml
global:
  resolve_timeout: 5m
  slack_api_url: "https://hooks.slack.com/services/YOUR/WEBHOOK/URL"

route:
  group_by: ["alertname", "job"]
  group_wait: 30s       # wait 30s to batch related alerts
  group_interval: 5m
  repeat_interval: 4h
  receiver: slack-warnings
  routes:
    - match:
        severity: critical
      receiver: pagerduty-critical

receivers:
  - name: slack-warnings
    slack_configs:
      - channel: "#alerts-infra"
        title: "⚠️ {{ .GroupLabels.alertname }}"
        text: "{{ range .Alerts }}{{ .Annotations.description }}
{{ end }}"
        send_resolved: true

  - name: pagerduty-critical
    pagerduty_configs:
      - routing_key: "YOUR_PAGERDUTY_INTEGRATION_KEY"
        description: "{{ .GroupLabels.alertname }} — {{ .CommonAnnotations.summary }}"
        send_resolved: true

Run Alertmanager alongside Prometheus:

# docker-compose.yml (excerpt)
  alertmanager:
    image: prom/alertmanager:v0.27.0
    ports:
      - "9093:9093"
    volumes:
      - ./alertmanager.yml:/etc/alertmanager/alertmanager.yml

Step 5 — Set Up Grafana

Run Grafana and provision Prometheus as a data source via config file so it persists across container restarts:

# grafana/provisioning/datasources/prometheus.yml
apiVersion: 1
datasources:
  - name: Prometheus
    type: prometheus
    url: http://prometheus:9090
    isDefault: true
    editable: false

# docker-compose.yml (excerpt)
  grafana:
    image: grafana/grafana:10.4.0
    ports:
      - "3001:3000"
    volumes:
      - ./grafana/provisioning:/etc/grafana/provisioning
      - grafana-data:/var/lib/grafana
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=changeme

The three dashboards I provision for every Node.js deployment:

▸Service Health Overview — up/down per service, request rate, error rate, p50/p95/p99 latency
▸Node.js Runtime — heap used vs total, event loop lag, GC pause duration, active handles
▸Infrastructure — CPU and memory per host or container

NOTE

Import dashboard ID 11159 (Node.js Application Dashboard) from grafana.com as a starting point — it covers all default prom-client metrics without any extra config.

Key PromQL Queries to Bookmark

These are the queries I reach for first when investigating an incident:

# Requests per second
rate(http_requests_total[1m])

# Error rate as a percentage
rate(http_requests_total{status_code=~"5.."}[5m])
/ rate(http_requests_total[5m]) * 100

# p95 latency
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

# Event loop lag — high value means the thread is blocked
node_nodejs_eventloop_lag_seconds

# Heap usage percentage
node_nodejs_heap_size_used_bytes / node_nodejs_heap_size_total_bytes * 100

# Active HTTP connections
node_http_server_connections_total

Production Tips

▸Use a named Docker volume for Prometheus data — anonymous volumes are wiped when the container is removed
▸Set scrape_interval to 15s for most services; drop to 5s only for critical paths (payments, auth) where faster alerting is worth the extra storage
▸Add a silence in Alertmanager before planned deployments — prevents alert fatigue from expected restarts
▸Store Slack webhook URLs and PagerDuty keys in environment variables or a secrets manager, not hardcoded in alertmanager.yml
▸Enable Prometheus remote_write to a long-term store (Thanos, VictoriaMetrics) if you need more than 15 days of history for capacity planning

Wrapping Up

The full stack — prom-client in each service, Prometheus scraping them, Alertmanager routing to Slack and PagerDuty, Grafana for dashboards — runs entirely in Docker and takes half a day to set up the first time. After that, onboarding a new service means adding one job to prometheus.yml and dropping the metrics middleware into the new app.

The payoff: instead of finding out a service is broken when a user reports it, you get a Slack message the moment error rate crosses 5% or heap is climbing toward OOM — with a direct link to the Grafana panel showing exactly when and why it started.