Skip to content

Observability (Prometheus metrics)

mcp-data-platform exposes operational metrics in Prometheus format on a dedicated HTTP listener. Phase 1 instruments two chokepoints that, between them, cover every tool call through the platform:

  1. MCPToolCallMiddleware records request rate, latency, and outcome for every tool the platform serves (Trino, DataHub, S3, MCP gateway, REST shim, admin tools/call). One series per (tool, toolkit_kind, persona, status_category).
  2. apigateway transport records outbound HTTP rate and latency for every call made by the api toolkit — api_invoke_endpoint, api_export, and the REST gateway shim. One series per (connection, http_status_class, status_category).

Tracing (Phase 2) and per-toolkit instrumentation (Phase 3) are tracked on issue #428 and will land in follow-on PRs.

Enabling metrics

Configuration is environment-only for Phase 1.

Variable Default Purpose
OTEL_METRICS_ENABLED false Master switch. When false no MeterProvider is built and no listener is started.
OTEL_METRICS_ADDR :9090 Bind address for the /metrics HTTP listener.

The listener is intentionally separate from the platform's main MCP/HTTP listener so:

  • scrape traffic does not share the MCP/admin/portal auth path,
  • the metrics port can sit behind a Kubernetes NetworkPolicy (or be unreachable from outside the cluster) without affecting client-facing routes,
  • a slow or stuck scraper cannot starve the main accept loop.

Example:

export OTEL_METRICS_ENABLED=true
export OTEL_METRICS_ADDR=:9090
mcp-data-platform --config /etc/mcp-data-platform/platform.yaml

Kubernetes scrape config

Add a ServiceMonitor (Prometheus Operator) or a static scrape job:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: mcp-data-platform-metrics
spec:
  selector:
    matchLabels:
      app: mcp-data-platform
  endpoints:
    - port: metrics
      path: /metrics
      interval: 30s

Expose port 9090 from the pod and surface it as a metrics service port.

Exposed metrics

Name Type Labels
mcp_tool_calls_total counter tool, toolkit_kind, persona, status_category
mcp_tool_call_duration_seconds histogram tool, toolkit_kind, persona, status_category
mcp_inflight_tool_calls gauge (none)
apigateway_outbound_total counter connection, http_status_class, status_category
apigateway_outbound_duration_seconds histogram connection, http_status_class, status_category

Plus the free Go runtime + process metrics (go_*, process_*).

Label semantics

The label set is deliberately small and closed. High-cardinality fields (user id, email, request id, session id, raw upstream URLs, raw error messages, free-text tool arguments) are not recorded as Prometheus labels — they belong on trace spans (Phase 2) and on audit log rows.

status_category values:

Value Meaning
ok Tool returned successfully.
auth_err Authentication failed (no/invalid credential).
authz_err User authenticated but persona denied the tool.
validation_err Bad arguments or the user declined an elicitation.
upstream_err Tool reached the upstream and the upstream returned an error (Trino query failure, S3 4xx/5xx, API 4xx/5xx, etc.).
internal_err Anything else — a platform bug. Watch this in dashboards; a healthy deployment is near zero.

http_status_class for outbound calls buckets the response into 2xx, 3xx, 4xx, 5xx, or other. Transport-level failures (DNS, dial, TLS, timeout) carry status 0 and surface as http_status_class="other" with status_category="upstream_err".

Sample PromQL queries

P95 latency per tool (last 5 minutes):

histogram_quantile(
  0.95,
  sum by (tool, le) (rate(mcp_tool_call_duration_seconds_bucket[5m]))
)

Tool error rate per minute, excluding auth/authz (those signal client misuse, not platform health):

sum by (tool) (
  rate(mcp_tool_calls_total{status_category=~"upstream_err|internal_err"}[1m])
)

Outbound 5xx rate per upstream connection:

sum by (connection) (
  rate(apigateway_outbound_total{http_status_class="5xx"}[1m])
)

In-flight tool calls right now:

mcp_inflight_tool_calls

Cardinality budget

Counter cardinality is the product of label cardinalities. With tool ≈ 40 tools, toolkit_kind ≈ 8, persona ≈ 5, and status_category = 6, the upper bound for mcp_tool_calls_total is 40 × 8 × 5 × 6 = 9,600 series. In practice only a fraction of combinations occur (most tools belong to one toolkit_kind, and status_category is heavily skewed toward ok).

For outbound: connection ≈ 10, http_status_class = 5, status_category = 6 → 300 series upper bound for apigateway_outbound_total.

Both are well under typical Prometheus limits and well within any managed observability backend's per-metric series budget. If you add labels, weigh the cardinality impact carefully — a user_id label would multiply series by the number of users.

What metrics do NOT replace

  • Audit logs answer "who called what entity, with what result" and remain the source of truth for compliance and user-level analytics. See docs/server/audit.md.
  • Application logs (stderr / structured slog) remain the source of truth for free-text diagnostic detail and stack traces.

Metrics answer "how is the system performing" — they complement, they do not replace.

Disabling metrics

Setting OTEL_METRICS_ENABLED=false (or leaving it unset) skips MeterProvider construction entirely, leaves the listener stopped, and reduces the request-time cost of the metrics middleware to a single nil-pointer compare per request. There is no "lightweight" in-memory metrics mode — either the full Prometheus exporter is running or nothing is.