Observability (Prometheus metrics)¶
mcp-data-platform exposes operational metrics in Prometheus format on a dedicated HTTP listener. Phase 1 instruments two chokepoints that, between them, cover every tool call through the platform:
MCPToolCallMiddlewarerecords request rate, latency, and outcome for every tool the platform serves (Trino, DataHub, S3, MCP gateway, REST shim, admin tools/call). One series per (tool, toolkit_kind, persona, status_category).- apigateway transport records outbound HTTP rate and latency for
every call made by the
apitoolkit —api_invoke_endpoint,api_export, and the REST gateway shim. One series per (connection, http_status_class, status_category).
Tracing (Phase 2) and per-toolkit instrumentation (Phase 3) are tracked on issue #428 and will land in follow-on PRs.
Configuration¶
Metrics are enabled by default. Configuration is environment-only for Phase 1.
| Variable | Default | Purpose |
|---|---|---|
OTEL_METRICS_ENABLED |
true |
Master switch. Set to false (or 0) to skip MeterProvider construction and not start the listener. |
OTEL_METRICS_ADDR |
:9090 |
Bind address for the /metrics HTTP listener. |
The listener is intentionally separate from the platform's main MCP/HTTP listener so:
- scrape traffic does not share the MCP/admin/portal auth path,
- the metrics port can sit behind a Kubernetes
NetworkPolicy(or be unreachable from outside the cluster) without affecting client-facing routes, - a slow or stuck scraper cannot starve the main accept loop.
To disable on a specific instance:
Kubernetes scrape config¶
Add a ServiceMonitor (Prometheus Operator) or a static scrape job:
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: mcp-data-platform-metrics
spec:
selector:
matchLabels:
app: mcp-data-platform
endpoints:
- port: metrics
path: /metrics
interval: 30s
Expose port 9090 from the pod and surface it as a metrics service port.
Exposed metrics¶
| Name | Type | Labels |
|---|---|---|
mcp_tool_calls_total |
counter | tool, toolkit_kind, persona, status_category |
mcp_tool_call_duration_seconds |
histogram | tool, toolkit_kind, persona, status_category |
mcp_inflight_tool_calls |
gauge | (none) |
apigateway_outbound_total |
counter | connection, http_status_class, status_category |
apigateway_outbound_duration_seconds |
histogram | connection, http_status_class, status_category |
apigateway_inbound_requests_total |
counter | connection, operation_id, method, status_class, identity |
apigateway_inbound_duration_seconds |
histogram | connection, operation_id, method, status_class |
Plus the free Go runtime + process metrics (go_*, process_*).
The apigateway_inbound_* pair measures requests hitting the REST shim
(POST /api/v1/gateway/{connection}/invoke, the NiFi-class ETL path),
as opposed to apigateway_outbound_*, which measures the platform's own
calls to the upstream API. operation_id is the OpenAPI operationId
resolved from the connection's catalog by path-template matching (e.g.
GET /v1/users/123 resolves to getUser); it is unknown for
connections with no catalog or requests that match no spec path.
identity is the API key name or OIDC subject (unknown when
unauthenticated) and is recorded on the request counter only, never on
the duration histogram, to keep the histogram's bucket series from
multiplying by the identity dimension. The connection and method
labels are clamped to the registered-connection set and the supported
HTTP-method set respectively, so an arbitrary URL segment or request
body cannot mint unbounded label values (both fall back to unknown).
Resolving identity re-authenticates the request token at the metrics
layer (the REST shim does not surface the in-session identity back up to
the HTTP handler). For API-key callers this is a cheap lookup; for OIDC
it re-verifies the JWT per request. On very high-volume inbound traffic a
per-token identity cache is the planned optimization; until then the
extra verification is the cost of the identity label.
Label semantics¶
The label set is deliberately small and closed. High-cardinality fields (user id, request id, session id, raw upstream URLs, raw error messages, free-text tool arguments) are not recorded as Prometheus labels; they belong on trace spans (Phase 2) and on audit log rows.
The one deliberate exception is the identity label on
apigateway_inbound_requests_total, which is the API key name or OIDC
subject (and may therefore be an email). Its cardinality is bounded by
the count of real callers, which is small for the NiFi-class ETL clients
this metric targets, and it is recorded on the counter only, never on a
histogram.
status_category values:
| Value | Meaning |
|---|---|
ok |
Tool returned successfully. |
auth_err |
Authentication failed (no/invalid credential). |
authz_err |
User authenticated but persona denied the tool. |
validation_err |
Bad arguments or the user declined an elicitation. |
upstream_err |
Tool reached the upstream and the upstream returned an error (Trino query failure, S3 4xx/5xx, API 4xx/5xx, etc.). |
internal_err |
Anything else — a platform bug. Watch this in dashboards; a healthy deployment is near zero. |
http_status_class for outbound calls buckets the response into 2xx,
3xx, 4xx, 5xx, or other. Transport-level failures (DNS, dial,
TLS, timeout) carry status 0 and surface as http_status_class="other"
with status_category="upstream_err".
Sample PromQL queries¶
P95 latency per tool (last 5 minutes):
Tool error rate per minute, excluding auth/authz (those signal client misuse, not platform health):
Outbound 5xx rate per upstream connection:
In-flight tool calls right now:
Cardinality budget¶
Counter cardinality is the product of label cardinalities. With
tool ≈ 40 tools, toolkit_kind ≈ 8, persona ≈ 5, and
status_category = 6, the upper bound for
mcp_tool_calls_total is 40 × 8 × 5 × 6 = 9,600 series. In practice
only a fraction of combinations occur (most tools belong to one
toolkit_kind, and status_category is heavily skewed toward ok).
For outbound: connection ≈ 10, http_status_class = 5,
status_category = 6 → 300 series upper bound for
apigateway_outbound_total.
Both are well under typical Prometheus limits and well within any
managed observability backend's per-metric series budget. If you add
labels, weigh the cardinality impact carefully — a user_id label
would multiply series by the number of users.
What metrics do NOT replace¶
- Audit logs answer "who called what entity, with what
result" and remain the source of truth for compliance and
user-level analytics. See
docs/server/audit.md. - Application logs (stderr / structured slog) remain the source of truth for free-text diagnostic detail and stack traces.
Metrics answer "how is the system performing" — they complement, they do not replace.
Disabling metrics¶
Setting OTEL_METRICS_ENABLED=false skips MeterProvider construction
entirely, leaves the listener stopped, and reduces the request-time
cost of the metrics middleware to a single nil-pointer compare per
request. There is no "lightweight" in-memory metrics mode: either
the full Prometheus exporter is running or nothing is.
PromQL query proxy¶
The metrics above are scraped by Prometheus. To let the portal read them back without exposing Prometheus to the browser (CORS, a separate auth path, an internal service on the public edge), the platform serves a thin authenticated proxy:
| Endpoint | Forwards to |
|---|---|
GET /api/v1/observability/query?query=...&time=... |
Prometheus /api/v1/query |
GET /api/v1/observability/query_range?query=...&start=...&end=...&step=... |
Prometheus /api/v1/query_range |
The proxy reuses the platform auth and persona model and keeps Prometheus on the internal network. The upstream response body is returned unchanged, so the portal can use any PromQL client library.
Configuration¶
Unlike the metrics emitters (environment-only), the proxy is configured
in platform.yaml:
observability:
prometheus:
url: "http://prometheus.observability.svc.cluster.local:9090"
timeout: 30s
basic_auth:
username: "${PROM_USER}"
password: "${PROM_PASS}"
rate_limit_per_second: 10 # per persona; 0 selects the default (10)
When url is empty the proxy is unconfigured: its endpoints return
503 with body observability backend not configured so the portal
renders a clean empty state instead of erroring.
Access control¶
Each request must be authenticated and the caller's persona must grant
the observability:read capability. This capability is checked through
the same persona tool-allow filter that gates tools, so operators grant
it in the portal persona editor by adding observability:read to a
persona's allowed tools. Default-deny applies: a persona without it (and
without a matching wildcard) is denied with 403. Admin personas with
allow: ["*"] receive it automatically.
A per-persona rate limit (default 10 queries/second) returns 429 when
exceeded, so a runaway portal session for one persona cannot starve
others. Proxy queries are not written to the audit log: the dashboards
poll these endpoints on a refresh interval, so auditing each one flooded
the audit trail and the tool-usage analytics with dashboard-internal
reads that are not MCP tool calls. Responses are not cached on the
platform; Prometheus is the cache and the portal applies its own
client-side stale-time.