Skip to content

Observability (Prometheus metrics)

mcp-data-platform exposes operational metrics in Prometheus format on a dedicated HTTP listener. Phase 1 instruments two chokepoints that, between them, cover every tool call through the platform:

  1. MCPToolCallMiddleware records request rate, latency, and outcome for every tool the platform serves (Trino, DataHub, S3, MCP gateway, REST shim, admin tools/call). One series per (tool, toolkit_kind, persona, status_category).
  2. apigateway transport records outbound HTTP rate and latency for every call made by the api toolkit — api_invoke_endpoint, api_export, and the REST gateway shim. One series per (connection, http_status_class, status_category).

Tracing (Phase 2) and per-toolkit instrumentation (Phase 3) are tracked on issue #428 and will land in follow-on PRs.

Configuration

Metrics are enabled by default. Configuration is environment-only for Phase 1.

Variable Default Purpose
OTEL_METRICS_ENABLED true Master switch. Set to false (or 0) to skip MeterProvider construction and not start the listener.
OTEL_METRICS_ADDR :9090 Bind address for the /metrics HTTP listener.

The listener is intentionally separate from the platform's main MCP/HTTP listener so:

  • scrape traffic does not share the MCP/admin/portal auth path,
  • the metrics port can sit behind a Kubernetes NetworkPolicy (or be unreachable from outside the cluster) without affecting client-facing routes,
  • a slow or stuck scraper cannot starve the main accept loop.

To disable on a specific instance:

export OTEL_METRICS_ENABLED=false
mcp-data-platform --config /etc/mcp-data-platform/platform.yaml

Kubernetes scrape config

Add a ServiceMonitor (Prometheus Operator) or a static scrape job:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: mcp-data-platform-metrics
spec:
  selector:
    matchLabels:
      app: mcp-data-platform
  endpoints:
    - port: metrics
      path: /metrics
      interval: 30s

Expose port 9090 from the pod and surface it as a metrics service port.

Exposed metrics

Name Type Labels
mcp_tool_calls_total counter tool, toolkit_kind, persona, status_category
mcp_tool_call_duration_seconds histogram tool, toolkit_kind, persona, status_category
mcp_inflight_tool_calls gauge (none)
apigateway_outbound_total counter connection, http_status_class, status_category
apigateway_outbound_duration_seconds histogram connection, http_status_class, status_category
apigateway_inbound_requests_total counter connection, operation_id, method, status_class, identity
apigateway_inbound_duration_seconds histogram connection, operation_id, method, status_class

Plus the free Go runtime + process metrics (go_*, process_*).

The apigateway_inbound_* pair measures requests hitting the REST shim (POST /api/v1/gateway/{connection}/invoke, the NiFi-class ETL path), as opposed to apigateway_outbound_*, which measures the platform's own calls to the upstream API. operation_id is the OpenAPI operationId resolved from the connection's catalog by path-template matching (e.g. GET /v1/users/123 resolves to getUser); it is unknown for connections with no catalog or requests that match no spec path. identity is the API key name or OIDC subject (unknown when unauthenticated) and is recorded on the request counter only, never on the duration histogram, to keep the histogram's bucket series from multiplying by the identity dimension. The connection and method labels are clamped to the registered-connection set and the supported HTTP-method set respectively, so an arbitrary URL segment or request body cannot mint unbounded label values (both fall back to unknown).

Resolving identity re-authenticates the request token at the metrics layer (the REST shim does not surface the in-session identity back up to the HTTP handler). For API-key callers this is a cheap lookup; for OIDC it re-verifies the JWT per request. On very high-volume inbound traffic a per-token identity cache is the planned optimization; until then the extra verification is the cost of the identity label.

Label semantics

The label set is deliberately small and closed. High-cardinality fields (user id, request id, session id, raw upstream URLs, raw error messages, free-text tool arguments) are not recorded as Prometheus labels; they belong on trace spans (Phase 2) and on audit log rows.

The one deliberate exception is the identity label on apigateway_inbound_requests_total, which is the API key name or OIDC subject (and may therefore be an email). Its cardinality is bounded by the count of real callers, which is small for the NiFi-class ETL clients this metric targets, and it is recorded on the counter only, never on a histogram.

status_category values:

Value Meaning
ok Tool returned successfully.
auth_err Authentication failed (no/invalid credential).
authz_err User authenticated but persona denied the tool.
validation_err Bad arguments or the user declined an elicitation.
upstream_err Tool reached the upstream and the upstream returned an error (Trino query failure, S3 4xx/5xx, API 4xx/5xx, etc.).
internal_err Anything else — a platform bug. Watch this in dashboards; a healthy deployment is near zero.

http_status_class for outbound calls buckets the response into 2xx, 3xx, 4xx, 5xx, or other. Transport-level failures (DNS, dial, TLS, timeout) carry status 0 and surface as http_status_class="other" with status_category="upstream_err".

Sample PromQL queries

P95 latency per tool (last 5 minutes):

histogram_quantile(
  0.95,
  sum by (tool, le) (rate(mcp_tool_call_duration_seconds_bucket[5m]))
)

Tool error rate per minute, excluding auth/authz (those signal client misuse, not platform health):

sum by (tool) (
  rate(mcp_tool_calls_total{status_category=~"upstream_err|internal_err"}[1m])
)

Outbound 5xx rate per upstream connection:

sum by (connection) (
  rate(apigateway_outbound_total{http_status_class="5xx"}[1m])
)

In-flight tool calls right now:

mcp_inflight_tool_calls

Cardinality budget

Counter cardinality is the product of label cardinalities. With tool ≈ 40 tools, toolkit_kind ≈ 8, persona ≈ 5, and status_category = 6, the upper bound for mcp_tool_calls_total is 40 × 8 × 5 × 6 = 9,600 series. In practice only a fraction of combinations occur (most tools belong to one toolkit_kind, and status_category is heavily skewed toward ok).

For outbound: connection ≈ 10, http_status_class = 5, status_category = 6 → 300 series upper bound for apigateway_outbound_total.

Both are well under typical Prometheus limits and well within any managed observability backend's per-metric series budget. If you add labels, weigh the cardinality impact carefully — a user_id label would multiply series by the number of users.

What metrics do NOT replace

  • Audit logs answer "who called what entity, with what result" and remain the source of truth for compliance and user-level analytics. See docs/server/audit.md.
  • Application logs (stderr / structured slog) remain the source of truth for free-text diagnostic detail and stack traces.

Metrics answer "how is the system performing" — they complement, they do not replace.

Disabling metrics

Setting OTEL_METRICS_ENABLED=false skips MeterProvider construction entirely, leaves the listener stopped, and reduces the request-time cost of the metrics middleware to a single nil-pointer compare per request. There is no "lightweight" in-memory metrics mode: either the full Prometheus exporter is running or nothing is.

PromQL query proxy

The metrics above are scraped by Prometheus. To let the portal read them back without exposing Prometheus to the browser (CORS, a separate auth path, an internal service on the public edge), the platform serves a thin authenticated proxy:

Endpoint Forwards to
GET /api/v1/observability/query?query=...&time=... Prometheus /api/v1/query
GET /api/v1/observability/query_range?query=...&start=...&end=...&step=... Prometheus /api/v1/query_range

The proxy reuses the platform auth and persona model and keeps Prometheus on the internal network. The upstream response body is returned unchanged, so the portal can use any PromQL client library.

Configuration

Unlike the metrics emitters (environment-only), the proxy is configured in platform.yaml:

observability:
  prometheus:
    url: "http://prometheus.observability.svc.cluster.local:9090"
    timeout: 30s
    basic_auth:
      username: "${PROM_USER}"
      password: "${PROM_PASS}"
    rate_limit_per_second: 10   # per persona; 0 selects the default (10)

When url is empty the proxy is unconfigured: its endpoints return 503 with body observability backend not configured so the portal renders a clean empty state instead of erroring.

Access control

Each request must be authenticated and the caller's persona must grant the observability:read capability. This capability is checked through the same persona tool-allow filter that gates tools, so operators grant it in the portal persona editor by adding observability:read to a persona's allowed tools. Default-deny applies: a persona without it (and without a matching wildcard) is denied with 403. Admin personas with allow: ["*"] receive it automatically.

A per-persona rate limit (default 10 queries/second) returns 429 when exceeded, so a runaway portal session for one persona cannot starve others. Proxy queries are not written to the audit log: the dashboards poll these endpoints on a refresh interval, so auditing each one flooded the audit trail and the tool-usage analytics with dashboard-internal reads that are not MCP tool calls. Responses are not cached on the platform; Prometheus is the cache and the portal applies its own client-side stale-time.