Observability (Prometheus metrics)¶
mcp-data-platform exposes operational metrics in Prometheus format on a dedicated HTTP listener. Phase 1 instruments two chokepoints that, between them, cover every tool call through the platform:
MCPToolCallMiddlewarerecords request rate, latency, and outcome for every tool the platform serves (Trino, DataHub, S3, MCP gateway, REST shim, admin tools/call). One series per (tool, toolkit_kind, persona, status_category).- apigateway transport records outbound HTTP rate and latency for
every call made by the
apitoolkit —api_invoke_endpoint,api_export, and the REST gateway shim. One series per (connection, http_status_class, status_category).
Tracing (Phase 2) and per-toolkit instrumentation (Phase 3) are tracked on issue #428 and will land in follow-on PRs.
Enabling metrics¶
Configuration is environment-only for Phase 1.
| Variable | Default | Purpose |
|---|---|---|
OTEL_METRICS_ENABLED |
false |
Master switch. When false no MeterProvider is built and no listener is started. |
OTEL_METRICS_ADDR |
:9090 |
Bind address for the /metrics HTTP listener. |
The listener is intentionally separate from the platform's main MCP/HTTP listener so:
- scrape traffic does not share the MCP/admin/portal auth path,
- the metrics port can sit behind a Kubernetes
NetworkPolicy(or be unreachable from outside the cluster) without affecting client-facing routes, - a slow or stuck scraper cannot starve the main accept loop.
Example:
export OTEL_METRICS_ENABLED=true
export OTEL_METRICS_ADDR=:9090
mcp-data-platform --config /etc/mcp-data-platform/platform.yaml
Kubernetes scrape config¶
Add a ServiceMonitor (Prometheus Operator) or a static scrape job:
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: mcp-data-platform-metrics
spec:
selector:
matchLabels:
app: mcp-data-platform
endpoints:
- port: metrics
path: /metrics
interval: 30s
Expose port 9090 from the pod and surface it as a metrics service port.
Exposed metrics¶
| Name | Type | Labels |
|---|---|---|
mcp_tool_calls_total |
counter | tool, toolkit_kind, persona, status_category |
mcp_tool_call_duration_seconds |
histogram | tool, toolkit_kind, persona, status_category |
mcp_inflight_tool_calls |
gauge | (none) |
apigateway_outbound_total |
counter | connection, http_status_class, status_category |
apigateway_outbound_duration_seconds |
histogram | connection, http_status_class, status_category |
Plus the free Go runtime + process metrics (go_*, process_*).
Label semantics¶
The label set is deliberately small and closed. High-cardinality fields (user id, email, request id, session id, raw upstream URLs, raw error messages, free-text tool arguments) are not recorded as Prometheus labels — they belong on trace spans (Phase 2) and on audit log rows.
status_category values:
| Value | Meaning |
|---|---|
ok |
Tool returned successfully. |
auth_err |
Authentication failed (no/invalid credential). |
authz_err |
User authenticated but persona denied the tool. |
validation_err |
Bad arguments or the user declined an elicitation. |
upstream_err |
Tool reached the upstream and the upstream returned an error (Trino query failure, S3 4xx/5xx, API 4xx/5xx, etc.). |
internal_err |
Anything else — a platform bug. Watch this in dashboards; a healthy deployment is near zero. |
http_status_class for outbound calls buckets the response into 2xx,
3xx, 4xx, 5xx, or other. Transport-level failures (DNS, dial,
TLS, timeout) carry status 0 and surface as http_status_class="other"
with status_category="upstream_err".
Sample PromQL queries¶
P95 latency per tool (last 5 minutes):
Tool error rate per minute, excluding auth/authz (those signal client misuse, not platform health):
Outbound 5xx rate per upstream connection:
In-flight tool calls right now:
Cardinality budget¶
Counter cardinality is the product of label cardinalities. With
tool ≈ 40 tools, toolkit_kind ≈ 8, persona ≈ 5, and
status_category = 6, the upper bound for
mcp_tool_calls_total is 40 × 8 × 5 × 6 = 9,600 series. In practice
only a fraction of combinations occur (most tools belong to one
toolkit_kind, and status_category is heavily skewed toward ok).
For outbound: connection ≈ 10, http_status_class = 5,
status_category = 6 → 300 series upper bound for
apigateway_outbound_total.
Both are well under typical Prometheus limits and well within any
managed observability backend's per-metric series budget. If you add
labels, weigh the cardinality impact carefully — a user_id label
would multiply series by the number of users.
What metrics do NOT replace¶
- Audit logs answer "who called what entity, with what
result" and remain the source of truth for compliance and
user-level analytics. See
docs/server/audit.md. - Application logs (stderr / structured slog) remain the source of truth for free-text diagnostic detail and stack traces.
Metrics answer "how is the system performing" — they complement, they do not replace.
Disabling metrics¶
Setting OTEL_METRICS_ENABLED=false (or leaving it unset) skips
MeterProvider construction entirely, leaves the listener stopped,
and reduces the request-time cost of the metrics middleware to a
single nil-pointer compare per request. There is no "lightweight"
in-memory metrics mode — either the full Prometheus exporter is
running or nothing is.