Observability (Prometheus metrics)¶

mcp-data-platform exposes operational metrics in Prometheus format on a dedicated HTTP listener. Phase 1 instruments two chokepoints that, between them, cover every tool call through the platform:

MCPToolCallMiddleware records request rate, latency, and outcome for every tool the platform serves (Trino, DataHub, S3, MCP gateway, REST shim, admin tools/call). One series per (tool, toolkit_kind, persona, status_category).
apigateway transport records outbound HTTP rate and latency for every call made by the api toolkit — api_invoke_endpoint, api_export, and the REST gateway shim. One series per (connection, http_status_class, status_category).

Tracing (Phase 2) and per-toolkit instrumentation (Phase 3) are tracked on issue #428 and will land in follow-on PRs.

Enabling metrics¶

Configuration is environment-only for Phase 1.

Variable	Default	Purpose
`OTEL_METRICS_ENABLED`	`false`	Master switch. When `false` no MeterProvider is built and no listener is started.
`OTEL_METRICS_ADDR`	`:9090`	Bind address for the `/metrics` HTTP listener.

The listener is intentionally separate from the platform's main MCP/HTTP listener so:

scrape traffic does not share the MCP/admin/portal auth path,
the metrics port can sit behind a Kubernetes NetworkPolicy (or be unreachable from outside the cluster) without affecting client-facing routes,
a slow or stuck scraper cannot starve the main accept loop.

Example:

export OTEL_METRICS_ENABLED=true
export OTEL_METRICS_ADDR=:9090
mcp-data-platform --config /etc/mcp-data-platform/platform.yaml

Kubernetes scrape config¶

Add a ServiceMonitor (Prometheus Operator) or a static scrape job:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: mcp-data-platform-metrics
spec:
  selector:
    matchLabels:
      app: mcp-data-platform
  endpoints:
    - port: metrics
      path: /metrics
      interval: 30s

Expose port 9090 from the pod and surface it as a metrics service port.

Exposed metrics¶

Name	Type	Labels
`mcp_tool_calls_total`	counter	`tool`, `toolkit_kind`, `persona`, `status_category`
`mcp_tool_call_duration_seconds`	histogram	`tool`, `toolkit_kind`, `persona`, `status_category`
`mcp_inflight_tool_calls`	gauge	(none)
`apigateway_outbound_total`	counter	`connection`, `http_status_class`, `status_category`
`apigateway_outbound_duration_seconds`	histogram	`connection`, `http_status_class`, `status_category`

Plus the free Go runtime + process metrics (go_*, process_*).

Label semantics¶

The label set is deliberately small and closed. High-cardinality fields (user id, email, request id, session id, raw upstream URLs, raw error messages, free-text tool arguments) are not recorded as Prometheus labels — they belong on trace spans (Phase 2) and on audit log rows.

status_category values:

Value	Meaning
`ok`	Tool returned successfully.
`auth_err`	Authentication failed (no/invalid credential).
`authz_err`	User authenticated but persona denied the tool.
`validation_err`	Bad arguments or the user declined an elicitation.
`upstream_err`	Tool reached the upstream and the upstream returned an error (Trino query failure, S3 4xx/5xx, API 4xx/5xx, etc.).
`internal_err`	Anything else — a platform bug. Watch this in dashboards; a healthy deployment is near zero.

http_status_class for outbound calls buckets the response into 2xx, 3xx, 4xx, 5xx, or other. Transport-level failures (DNS, dial, TLS, timeout) carry status 0 and surface as http_status_class="other" with status_category="upstream_err".

Sample PromQL queries¶

P95 latency per tool (last 5 minutes):

histogram_quantile(
  0.95,
  sum by (tool, le) (rate(mcp_tool_call_duration_seconds_bucket[5m]))
)

Tool error rate per minute, excluding auth/authz (those signal client misuse, not platform health):

sum by (tool) (
  rate(mcp_tool_calls_total{status_category=~"upstream_err|internal_err"}[1m])
)

Outbound 5xx rate per upstream connection:

sum by (connection) (
  rate(apigateway_outbound_total{http_status_class="5xx"}[1m])
)

In-flight tool calls right now:

mcp_inflight_tool_calls

Cardinality budget¶

Counter cardinality is the product of label cardinalities. With tool ≈ 40 tools, toolkit_kind ≈ 8, persona ≈ 5, and status_category = 6, the upper bound for mcp_tool_calls_total is 40 × 8 × 5 × 6 = 9,600 series. In practice only a fraction of combinations occur (most tools belong to one toolkit_kind, and status_category is heavily skewed toward ok).

For outbound: connection ≈ 10, http_status_class = 5, status_category = 6 → 300 series upper bound for apigateway_outbound_total.

Both are well under typical Prometheus limits and well within any managed observability backend's per-metric series budget. If you add labels, weigh the cardinality impact carefully — a user_id label would multiply series by the number of users.

What metrics do NOT replace¶

Audit logs answer "who called what entity, with what result" and remain the source of truth for compliance and user-level analytics. See docs/server/audit.md.
Application logs (stderr / structured slog) remain the source of truth for free-text diagnostic detail and stack traces.

Metrics answer "how is the system performing" — they complement, they do not replace.

Disabling metrics¶

Setting OTEL_METRICS_ENABLED=false (or leaving it unset) skips MeterProvider construction entirely, leaves the listener stopped, and reduces the request-time cost of the metrics middleware to a single nil-pointer compare per request. There is no "lightweight" in-memory metrics mode — either the full Prometheus exporter is running or nothing is.