Tuning and Scaling¶
This page documents resource sizing, Go runtime tuning, and horizontal-scaling
characteristics for mcp-data-platform. Numbers are starting points for a
production deployment; measure your own workload with the built-in Prometheus
endpoint before locking limits.
1. Baseline measurements¶
Steady-state observations from a single-replica production install handling
roughly 1 request/second of API gateway traffic (NiFi-driven), default config,
LOG_LEVEL=debug, semantic enrichment enabled:
| Metric | Value |
|---|---|
| CPU (avg) | ~125m (range 100m to 160m) |
| CPU (peak observed) | 160m |
| Memory (RSS) | ~68 MiB, very stable |
| Pod uptime sampled | ~2.5 hours, 10 samples 10s apart |
The pod was running LOG_LEVEL=debug, which inflates CPU and allocations. A
production install should run at info. Memory is essentially flat; the Go
heap is bounded by short-lived per-request allocations plus a small set of
long-lived caches.
2. Resource requests and limits¶
The defaults shipped in configs/ are intentionally conservative. For higher
traffic, scale them as follows. The "high-traffic" column targets ~10 sustained
requests/second with bursty peaks (e.g., scheduled ETL jobs against the API
gateway).
| Field | Low (≤1 RPS) | Medium (1-5 RPS) | High (5-15 RPS) |
|---|---|---|---|
requests.cpu |
100m | 250m | 500m |
limits.cpu |
500m | 1500m | 3000m |
requests.memory |
128Mi | 256Mi | 512Mi |
limits.memory |
256Mi | 512Mi | 1Gi |
Set requests.cpu close to observed steady-state to give the scheduler an
honest picture; set limits.cpu 3-5x higher than steady-state to absorb burst
without throttling. CPU throttling under burst load is the most common cause
of latency spikes in this service.
3. Go runtime environment¶
The binary is a static Go program; the Go runtime is not cgroup-aware by default. Set these env vars on the container to match the runtime to the cgroup.
GOMEMLIMIT (required)¶
GOMEMLIMIT tells the Go GC the soft memory cap. Without it the GC defaults
to a heap-relative target (GOGC=100, double the live heap), which can push
allocations past the cgroup memory limit and trigger an OOM kill even though
the process could have GC'd more aggressively.
The 90% rule leaves headroom for off-heap allocations (cgo, network buffers,
stack), which GOMEMLIMIT does not bound. Pair with a Kubernetes downward
API reference if you want it to track the limit automatically:
(Then multiply or use a percentage-based wrapper if you want headroom.)
GOMAXPROCS (required)¶
GOMAXPROCS defaults to the number of host CPUs visible inside the
container, which in Kubernetes is the node's CPU count, not the cgroup's
quota. On a 64-core node with a 500m CPU limit, Go spawns 64 worker threads,
fights itself for the 0.5 CPU quota, and wastes cycles on context switches
and scheduler contention.
Two options:
- Static value matching
limits.cpurounded up:
go.uber.org/automaxprocs: pull the package intomain.go; it reads the cgroup CPU quota at startup and setsGOMAXPROCSaccordingly. This is the recommended approach for containers where the limit may change between deployments.
GOGC (optional)¶
The default GOGC=100 is fine for typical workloads. Lower values (50, 75)
GC more aggressively, trading CPU for lower steady-state heap. Higher values
reduce GC CPU at the cost of more RSS. Tune only with measurements in hand;
do not lower GOGC to "save memory" without checking that GOMEMLIMIT is
already in place.
Putting it together¶
env:
- name: GOMEMLIMIT
value: "900MiB" # limits.memory: 1Gi, ~88%
- name: GOMAXPROCS
value: "3" # limits.cpu: 3000m
- name: GOGC
value: "100" # default; document the intent
4. Horizontal scaling¶
The service is designed to run with multiple replicas behind a Kubernetes Service. The following components are HA-safe:
- OAuth 2.1 server: clients, authorization codes, refresh tokens, and
PKCE verifiers are persisted to PostgreSQL when
DATABASE_DSNis set (pkg/oauth/postgres/store.go). The in-memory store is a dev-only fallback forDATABASE_DSN-less mode. - Audit log: writes go straight to PostgreSQL
(
pkg/audit/postgres/store.go). At 1M tool calls/day that is roughly 12 writes/second average, well within a single Postgres instance. - Embedding jobs: the embed-jobs worker uses a PostgreSQL-backed queue
with
pg_try_advisory_lockfor coordination (pkg/platform/apigateway_embed_jobs.go). Multiple replicas compete for work without duplicating jobs. - Connection OAuth refresh: the upstream token refresher uses a
PostgreSQL advisory lock so only one replica refreshes a given connection
at a time (
pkg/connoauth/refresher.go). - API gateway REST shim: each REST request builds an ephemeral in-memory
MCP session for the duration of the call
(
pkg/gatewayhttp/handler.go:203). There is no cross-request session state, so any replica can serve any request. - Outbound HTTP: the API gateway toolkit maintains a per-connection
http.TransportwithMaxIdleConnsandIdleConnTimeout(pkg/toolkits/apigateway/toolkit.go:1046). Connections to upstream APIs are pooled inside each replica.
Per-replica state to be aware of¶
These caches are per-replica. They affect behavior, not correctness:
SessionEnrichmentCache: deduplicates semantic enrichment payloads within a long-running MCP session (pkg/middleware/session_cache.go). REST-shim calls (the high-volume HTTP-client path) get a fresh session per request, so this cache is effectively bypassed. For sticky MCP sessions (Claude Desktop, Cursor), routing the same session to a different replica costs a few extra enrichment payloads, not correctness.- Portal rate limiter: token bucket keyed by IP in
pkg/portal/. This guards the public viewer page, not the API gateway. With N replicas a single client sees roughly N times its configured budget. If you depend on the portal rate limit for SLO enforcement, terminate at an ingress-level rate limiter instead.
Replica count and PostgreSQL connections¶
The DB pool defaults to MaxOpenConns = 25 per replica
(pkg/platform/config.go:63). Three replicas total 75 connections. Default
Postgres max_connections is 100; account for the migrate job, admin REST
handlers, and any other tenants of the same database.
Recommended:
- 1 replica: leave defaults.
- 2-3 replicas: drop
database.max_open_connsto 15 inplatform.yaml(3 × 15 = 45, comfortable margin). - Run a separate read-replica or pgbouncer if you scale beyond 3.
Liveness and readiness¶
The deployment exposes /healthz and /readyz on port 8080:
readinessProbe: 5s initial, 10s period, 3s timeoutlivenessProbe: 10s initial, 30s period, 3s timeout
On rolling updates, set strategy.rollingUpdate.maxSurge: 1 and
maxUnavailable: 0 so at least one replica is always serving.
Graceful shutdown¶
On SIGTERM the platform runs a four-stage shutdown chain. Each stage has
its own timeout; the sum must fit inside the pod's
terminationGracePeriodSeconds or Kubernetes will SIGKILL whatever is
still running.
| Stage | What happens | Default | Configurable via |
|---|---|---|---|
| 1. Pre-shutdown delay | /readyz flips to draining (503). Sleep so the LoadBalancer/Ingress can deregister this pod and stop sending new requests. |
2s | server.shutdown.pre_shutdown_delay |
| 2. HTTP drain | http.Server.Shutdown waits for in-flight handlers (MCP tool calls, REST shim invokes) to return. Handlers that don't finish by the deadline have their request context canceled and are abandoned. |
25s | server.shutdown.grace_period |
| 3. Lifecycle stop | platform.Stop fires every OnStop callback: embed-jobs worker, reaper, reconciler, LISTEN/NOTIFY listener. Bounded so a hung worker (slow Postgres, stuck embedding call) cannot stall shutdown. Abandoned jobs are safe: their PostgreSQL leases expire and another replica reclaims them on the next poll tick. |
10s | hard-coded lifecycleStopTimeout in cmd/mcp-data-platform/main.go |
| 4. Platform close | Audit flush, OAuth refresher stop, session cache flush, DB pool close, metrics provider shutdown. | a few seconds | n/a |
The full budget for the defaults is 2 + 25 + 10 + ~3 ≈ 40s. Set
terminationGracePeriodSeconds accordingly. The default 30s in the
example manifest is too tight for the default platform configuration;
60s leaves comfortable headroom.
For deployments with long-running tool calls (large Trino queries, slow
upstream API gateway calls), raise server.shutdown.grace_period and
terminationGracePeriodSeconds together. Reasonable starting point:
# Deployment manifest
spec:
template:
spec:
terminationGracePeriodSeconds: 70 # 3 + 45 + 10 + ~5 buffer
In-flight tool calls that exceed the grace period are abandoned, not rolled back. If a tool has a side effect (write to DataHub, S3 PUT, external API mutation), the side effect may or may not have completed when the handler is canceled. For idempotent operations this is fine; for non-idempotent ones, design the upstream caller to retry safely.
Pod anti-affinity¶
For 2+ replicas, prefer scheduling on different nodes:
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values:
- mcp-data-platform
topologyKey: kubernetes.io/hostname
5. Observability¶
Prometheus metrics are exposed on :9090 by default
(OTEL_METRICS_ADDR overrides; OTEL_METRICS_ENABLED=false disables
the listener). Metrics include per-tool
invocation counts and durations, API gateway upstream latency, and the Go
runtime collectors. With metrics on, the recommended HPA driver is
apigateway_invoke_duration_seconds_count rate-of-change (request rate) or
process_cpu_seconds_total (CPU saturation), not raw CPU utilization.
LOG_LEVEL=info is the production default. debug adds substantial
allocations on the hot path; only enable it temporarily.
6. Autoscaling¶
A horizontal pod autoscaler driven by CPU utilization works correctly once
GOMAXPROCS is set:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: mcp-data-platform
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: mcp-data-platform
minReplicas: 2
maxReplicas: 5
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 60
For traffic-shaped scaling, use the Prometheus adapter and target the API gateway request rate metric directly.