Deployment Guide¶
This guide covers deploying mcp-data-platform in various environments, from local development to production Kubernetes clusters.
Deployment Options¶
| Environment | Best For | Complexity |
|---|---|---|
| Docker Compose | Development, small teams, testing | Low |
| Kubernetes/Helm | Production, multi-user, enterprise | Medium |
Docker Compose (Development/Small Teams)¶
A complete full-stack deployment including DataHub, Trino, mcp-data-platform, Keycloak, and PostgreSQL.
Prerequisites¶
- Docker 24.0+
- Docker Compose 2.20+
- 16GB RAM minimum (DataHub requires significant memory)
- 20GB free disk space
Full-Stack Example¶
Create a docker-compose.yml:
services:
# PostgreSQL for metadata storage
postgres:
image: postgres:16-alpine@sha256:acf5271bce6b4b62e352341e3b175c2b1e9e0b6f6e3f2e7e3b7f8c9d0e1f2a3b
environment:
POSTGRES_USER: postgres
POSTGRES_PASSWORD: ${POSTGRES_PASSWORD:-postgres}
POSTGRES_MULTIPLE_DATABASES: datahub,keycloak,audit
volumes:
- postgres_data:/var/lib/postgresql/data
- ./init-multiple-dbs.sh:/docker-entrypoint-initdb.d/init-multiple-dbs.sh
healthcheck:
test: ["CMD-SHELL", "pg_isready -U postgres"]
interval: 10s
timeout: 5s
retries: 5
# Keycloak for authentication
keycloak:
image: quay.io/keycloak/keycloak:24.0@sha256:b3c4a5d6e7f8a9b0c1d2e3f4a5b6c7d8e9f0a1b2c3d4e5f6a7b8c9d0e1f2a3b4
command: start-dev --import-realm
environment:
KC_DB: postgres
KC_DB_URL: jdbc:postgresql://postgres:5432/keycloak
KC_DB_USERNAME: postgres
KC_DB_PASSWORD: ${POSTGRES_PASSWORD:-postgres}
KEYCLOAK_ADMIN: admin
KEYCLOAK_ADMIN_PASSWORD: ${KEYCLOAK_ADMIN_PASSWORD:-admin}
volumes:
- ./keycloak-realm.json:/opt/keycloak/data/import/realm.json
ports:
- "8180:8080"
depends_on:
postgres:
condition: service_healthy
# DataHub GMS (Metadata Service)
datahub-gms:
image: acryldata/datahub-gms:v0.13.0@sha256:c1d2e3f4a5b6c7d8e9f0a1b2c3d4e5f6a7b8c9d0e1f2a3b4c5d6e7f8a9b0c1d2
environment:
DATAHUB_GMS_HOST: datahub-gms
DATAHUB_GMS_PORT: 8080
EBEAN_DATASOURCE_HOST: postgres:5432
EBEAN_DATASOURCE_USERNAME: postgres
EBEAN_DATASOURCE_PASSWORD: ${POSTGRES_PASSWORD:-postgres}
ELASTICSEARCH_HOST: elasticsearch
ELASTICSEARCH_PORT: 9200
KAFKA_BOOTSTRAP_SERVER: kafka:9092
KAFKA_SCHEMAREGISTRY_URL: http://schema-registry:8081
depends_on:
postgres:
condition: service_healthy
elasticsearch:
condition: service_healthy
kafka:
condition: service_healthy
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8080/health"]
interval: 30s
timeout: 10s
retries: 5
# Elasticsearch for DataHub search
elasticsearch:
image: elasticsearch:7.17.18@sha256:a1b2c3d4e5f6a7b8c9d0e1f2a3b4c5d6e7f8a9b0c1d2e3f4a5b6c7d8e9f0a1b2
environment:
- discovery.type=single-node
- xpack.security.enabled=false
- ES_JAVA_OPTS=-Xms512m -Xmx512m
volumes:
- elasticsearch_data:/usr/share/elasticsearch/data
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:9200/_cluster/health"]
interval: 10s
timeout: 5s
retries: 10
# Kafka for DataHub events
kafka:
image: confluentinc/cp-kafka:7.6.0@sha256:b1c2d3e4f5a6b7c8d9e0f1a2b3c4d5e6f7a8b9c0d1e2f3a4b5c6d7e8f9a0b1c2
environment:
KAFKA_BROKER_ID: 1
KAFKA_ZOOKEEPER_CONNECT: zookeeper:2181
KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://kafka:9092
KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 1
depends_on:
- zookeeper
healthcheck:
test: ["CMD", "kafka-topics", "--bootstrap-server", "kafka:9092", "--list"]
interval: 30s
timeout: 10s
retries: 5
# Zookeeper for Kafka
zookeeper:
image: confluentinc/cp-zookeeper:7.6.0@sha256:a2b3c4d5e6f7a8b9c0d1e2f3a4b5c6d7e8f9a0b1c2d3e4f5a6b7c8d9e0f1a2b3
environment:
ZOOKEEPER_CLIENT_PORT: 2181
ZOOKEEPER_TICK_TIME: 2000
# Schema Registry for Kafka
schema-registry:
image: confluentinc/cp-schema-registry:7.6.0@sha256:c3d4e5f6a7b8c9d0e1f2a3b4c5d6e7f8a9b0c1d2e3f4a5b6c7d8e9f0a1b2c3d4
environment:
SCHEMA_REGISTRY_HOST_NAME: schema-registry
SCHEMA_REGISTRY_KAFKASTORE_BOOTSTRAP_SERVERS: kafka:9092
depends_on:
kafka:
condition: service_healthy
# Trino for SQL queries
trino:
image: trinodb/trino:440@sha256:d4e5f6a7b8c9d0e1f2a3b4c5d6e7f8a9b0c1d2e3f4a5b6c7d8e9f0a1b2c3d4e5
ports:
- "8081:8080"
volumes:
- ./trino-catalog:/etc/trino/catalog
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8080/v1/info"]
interval: 10s
timeout: 5s
retries: 10
# MCP Data Platform
mcp-data-platform:
image: ghcr.io/txn2/mcp-data-platform:latest
environment:
DATAHUB_TOKEN: ${DATAHUB_TOKEN}
DATABASE_URL: postgres://postgres:${POSTGRES_PASSWORD:-postgres}@postgres:5432/audit
OAUTH_SIGNING_KEY: ${OAUTH_SIGNING_KEY}
KEYCLOAK_CLIENT_SECRET: ${KEYCLOAK_CLIENT_SECRET}
volumes:
- ./platform.yaml:/etc/mcp/platform.yaml:ro
command: ["--config", "/etc/mcp/platform.yaml", "--transport", "http", "--address", ":8080"]
ports:
- "8080:8080"
depends_on:
datahub-gms:
condition: service_healthy
trino:
condition: service_healthy
keycloak:
condition: service_started
volumes:
postgres_data:
elasticsearch_data:
Platform Configuration¶
Create platform.yaml:
server:
name: mcp-data-platform
transport: http
address: ":8080"
toolkits:
datahub:
primary:
url: http://datahub-gms:8080
token: ${DATAHUB_TOKEN}
trino:
primary:
host: trino
port: 8080
user: trino
catalog: memory
ssl: false
oauth:
enabled: true
issuer: "http://localhost:8080"
signing_key: ${OAUTH_SIGNING_KEY}
clients:
- id: "claude-desktop"
secret: "claude-secret"
redirect_uris:
- "http://localhost"
- "http://127.0.0.1"
upstream:
issuer: "http://keycloak:8080/realms/mcp"
client_id: "mcp-data-platform"
client_secret: ${KEYCLOAK_CLIENT_SECRET}
redirect_uri: "http://localhost:8080/oauth/callback"
personas:
definitions:
analyst:
display_name: "Data Analyst"
roles: ["analyst"]
tools:
allow: ["trino_*", "datahub_*"]
deny: ["*_delete_*"]
admin:
display_name: "Administrator"
roles: ["admin"]
tools:
allow: ["*"]
default_persona: analyst
enrichment:
trino_semantic_enrichment: true
datahub_query_enrichment: true
column_context_filtering: true # Only enrich columns referenced in SQL (default: true)
audit:
enabled: true
log_tool_calls: true
database:
dsn: ${DATABASE_URL}
Start the Stack¶
# Generate secrets
export POSTGRES_PASSWORD=$(openssl rand -base64 32)
export OAUTH_SIGNING_KEY=$(openssl rand -base64 32)
export KEYCLOAK_CLIENT_SECRET=$(openssl rand -base64 32)
export DATAHUB_TOKEN="your-datahub-token"
# Start all services
docker compose up -d
# Wait for services to be healthy
docker compose ps
# View logs
docker compose logs -f mcp-data-platform
Local Development Workflow¶
For rapid iteration during development:
# Start dependencies only
docker compose up -d postgres elasticsearch kafka zookeeper schema-registry datahub-gms trino keycloak
# Run mcp-data-platform locally
go run ./cmd/mcp-data-platform --config platform.yaml --transport http --address :8080
Kubernetes/Helm (Production)¶
Production deployment using Helm charts with best practices for security, scaling, and monitoring.
Prerequisites¶
- Kubernetes 1.28+
- Helm 3.12+
- kubectl configured for your cluster
- TLS certificates (cert-manager recommended)
Helm Chart Structure¶
Create a Helm chart at charts/mcp-data-platform/:
charts/mcp-data-platform/
├── Chart.yaml
├── values.yaml
├── templates/
│ ├── _helpers.tpl
│ ├── deployment.yaml
│ ├── service.yaml
│ ├── configmap.yaml
│ ├── secret.yaml
│ ├── ingress.yaml
│ ├── hpa.yaml
│ ├── pdb.yaml
│ └── serviceaccount.yaml
Chart.yaml¶
apiVersion: v2
name: mcp-data-platform
description: Semantic data platform MCP server
type: application
version: 1.0.0
appVersion: "0.1.0"
values.yaml¶
replicaCount: 2
image:
repository: ghcr.io/txn2/mcp-data-platform
pullPolicy: IfNotPresent
tag: "latest"
serviceAccount:
create: true
annotations: {}
name: ""
podSecurityContext:
runAsNonRoot: true
runAsUser: 65534
runAsGroup: 65534
fsGroup: 65534
securityContext:
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
capabilities:
drop:
- ALL
service:
type: ClusterIP
port: 8080
ingress:
enabled: true
className: nginx
annotations:
cert-manager.io/cluster-issuer: letsencrypt-prod
nginx.ingress.kubernetes.io/proxy-body-size: "10m"
nginx.ingress.kubernetes.io/proxy-read-timeout: "300"
hosts:
- host: mcp.example.com
paths:
- path: /
pathType: Prefix
tls:
- secretName: mcp-data-platform-tls
hosts:
- mcp.example.com
resources:
limits:
cpu: 1000m
memory: 512Mi
requests:
cpu: 100m
memory: 128Mi
autoscaling:
enabled: true
minReplicas: 2
maxReplicas: 10
targetCPUUtilizationPercentage: 70
targetMemoryUtilizationPercentage: 80
pdb:
enabled: true
minAvailable: 1
# Platform configuration
config:
server:
name: mcp-data-platform
transport: http
address: ":8080"
tls:
enabled: false # TLS terminates at ingress
toolkits:
datahub:
primary:
url: http://datahub-gms.datahub:8080
trino:
primary:
host: trino.trino
port: 8080
user: mcp-platform
catalog: hive
ssl: false
enrichment:
trino_semantic_enrichment: true
datahub_query_enrichment: true
column_context_filtering: true # Only enrich columns referenced in SQL (default: true)
audit:
enabled: true
log_tool_calls: true
# External secrets (use external-secrets operator or sealed-secrets in production)
secrets:
datahubToken: ""
oauthSigningKey: ""
keycloakClientSecret: ""
databaseUrl: ""
# Prometheus metrics
metrics:
enabled: true
port: 9090
path: /metrics
# Health checks
probes:
liveness:
httpGet:
path: /health
port: http
initialDelaySeconds: 10
periodSeconds: 10
readiness:
httpGet:
path: /health
port: http
initialDelaySeconds: 5
periodSeconds: 5
templates/deployment.yaml¶
apiVersion: apps/v1
kind: Deployment
metadata:
name: {{ include "mcp-data-platform.fullname" . }}
labels:
{{- include "mcp-data-platform.labels" . | nindent 4 }}
spec:
{{- if not .Values.autoscaling.enabled }}
replicas: {{ .Values.replicaCount }}
{{- end }}
selector:
matchLabels:
{{- include "mcp-data-platform.selectorLabels" . | nindent 6 }}
template:
metadata:
annotations:
checksum/config: {{ include (print $.Template.BasePath "/configmap.yaml") . | sha256sum }}
checksum/secret: {{ include (print $.Template.BasePath "/secret.yaml") . | sha256sum }}
labels:
{{- include "mcp-data-platform.selectorLabels" . | nindent 8 }}
spec:
serviceAccountName: {{ include "mcp-data-platform.serviceAccountName" . }}
securityContext:
{{- toYaml .Values.podSecurityContext | nindent 8 }}
containers:
- name: {{ .Chart.Name }}
securityContext:
{{- toYaml .Values.securityContext | nindent 12 }}
image: "{{ .Values.image.repository }}:{{ .Values.image.tag | default .Chart.AppVersion }}"
imagePullPolicy: {{ .Values.image.pullPolicy }}
args:
- --config
- /etc/mcp/platform.yaml
- --transport
- http
- --address
- :8080
ports:
- name: http
containerPort: 8080
protocol: TCP
{{- if .Values.metrics.enabled }}
- name: metrics
containerPort: {{ .Values.metrics.port }}
protocol: TCP
{{- end }}
livenessProbe:
{{- toYaml .Values.probes.liveness | nindent 12 }}
readinessProbe:
{{- toYaml .Values.probes.readiness | nindent 12 }}
resources:
{{- toYaml .Values.resources | nindent 12 }}
env:
- name: DATAHUB_TOKEN
valueFrom:
secretKeyRef:
name: {{ include "mcp-data-platform.fullname" . }}
key: datahub-token
- name: OAUTH_SIGNING_KEY
valueFrom:
secretKeyRef:
name: {{ include "mcp-data-platform.fullname" . }}
key: oauth-signing-key
- name: KEYCLOAK_CLIENT_SECRET
valueFrom:
secretKeyRef:
name: {{ include "mcp-data-platform.fullname" . }}
key: keycloak-client-secret
- name: DATABASE_URL
valueFrom:
secretKeyRef:
name: {{ include "mcp-data-platform.fullname" . }}
key: database-url
volumeMounts:
- name: config
mountPath: /etc/mcp
readOnly: true
- name: tmp
mountPath: /tmp
volumes:
- name: config
configMap:
name: {{ include "mcp-data-platform.fullname" . }}
- name: tmp
emptyDir: {}
templates/hpa.yaml¶
{{- if .Values.autoscaling.enabled }}
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: {{ include "mcp-data-platform.fullname" . }}
labels:
{{- include "mcp-data-platform.labels" . | nindent 4 }}
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: {{ include "mcp-data-platform.fullname" . }}
minReplicas: {{ .Values.autoscaling.minReplicas }}
maxReplicas: {{ .Values.autoscaling.maxReplicas }}
metrics:
{{- if .Values.autoscaling.targetCPUUtilizationPercentage }}
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: {{ .Values.autoscaling.targetCPUUtilizationPercentage }}
{{- end }}
{{- if .Values.autoscaling.targetMemoryUtilizationPercentage }}
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: {{ .Values.autoscaling.targetMemoryUtilizationPercentage }}
{{- end }}
{{- end }}
templates/pdb.yaml¶
{{- if .Values.pdb.enabled }}
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: {{ include "mcp-data-platform.fullname" . }}
labels:
{{- include "mcp-data-platform.labels" . | nindent 4 }}
spec:
minAvailable: {{ .Values.pdb.minAvailable }}
selector:
matchLabels:
{{- include "mcp-data-platform.selectorLabels" . | nindent 6 }}
{{- end }}
Deploy to Kubernetes¶
# Create namespace
kubectl create namespace mcp-data-platform
# Create secrets (use external-secrets or sealed-secrets in production)
kubectl create secret generic mcp-data-platform-secrets \
--namespace mcp-data-platform \
--from-literal=datahub-token="$DATAHUB_TOKEN" \
--from-literal=oauth-signing-key="$OAUTH_SIGNING_KEY" \
--from-literal=keycloak-client-secret="$KEYCLOAK_CLIENT_SECRET" \
--from-literal=database-url="$DATABASE_URL"
# Install the chart
helm upgrade --install mcp-data-platform ./charts/mcp-data-platform \
--namespace mcp-data-platform \
--values values-production.yaml
# Verify deployment
kubectl get pods -n mcp-data-platform
kubectl get hpa -n mcp-data-platform
Production Checklist¶
Security¶
- TLS enabled for all external endpoints
- Secrets stored in external secrets manager (Vault, AWS Secrets Manager)
- Network policies restrict pod-to-pod communication
- Pod security context configured (non-root, read-only filesystem)
- Resource limits set for all containers
- OIDC configured with production identity provider
- API keys rotated regularly
High Availability¶
- Multiple replicas deployed (minimum 2)
- PodDisruptionBudget configured
- Anti-affinity rules spread pods across nodes
- Health checks configured for liveness and readiness
- HPA configured for automatic scaling
Monitoring¶
- Prometheus metrics enabled and scraped
- Grafana dashboards deployed
- Alerting rules configured
- Log aggregation set up (ELK, Loki)
- Distributed tracing enabled (Jaeger, Zipkin)
Operations¶
- Backup strategy for PostgreSQL audit logs
- Disaster recovery plan documented
- Runbooks for common issues
- On-call rotation established
MCP gateway (if enabled)¶
The gateway toolkit (kind mcp) has additional production
requirements:
-
ENCRYPTION_KEYis set (32 bytes of key material; accepted as 64 hex characters, 44-character base64, or 32 raw bytes). Required for at-rest encryption of stored credentials, OAuth access and refresh tokens (gateway_oauth_tokens), and PKCE state (oauth_pkce_states.code_verifier). Without it the platform logs a warning and stores those values in plaintext — not acceptable in production. - PostgreSQL is reachable from every replica and shared.
Multi-replica deployments rely on the Postgres-backed PKCE state
store so an
oauth-starton replica A and the redirect callback on replica B can find each other. The platform automatically uses Postgres whendatabase.dsnis set. - OAuth callback path (
/api/v1/admin/oauth/callback) is reachable on the public-facing URL of the platform. The upstream OAuth provider redirects the operator's browser here after sign-in; the path is intentionally public (state token authenticates the callback) and must be allowed through any reverse-proxy auth. - External Client App / OAuth client registration on each
upstream lists the platform's
/api/v1/admin/oauth/callbackURL as an allowed redirect URI. Required forauthorization_codegrants (e.g. Salesforce Hosted MCP). -
ENCRYPTION_KEYrotation plan. Rotating the key invalidates every encrypted value inconnection_instances,gateway_oauth_tokens, andoauth_pkce_states— gateway connections will lose their stored credentials and authorization_code connections will need to be re-Connected through the portal. Plan accordingly.
Upgrades and connected agents¶
The platform ships frequently, and each upgrade can change the tool contract (new
tools, new parameters, updated descriptions). How a connected agent picks up the new
contract depends on its client, because MCP delivers a changed tool list in-band only
on a live session; a binary upgrade is a new process, so the agent must reconnect to
re-handshake (initialize + tools/list) against the new build.
What the server does on shutdown¶
On SIGTERM (a rolling deploy), the server:
- Marks readiness draining so the load balancer stops routing new connections, then
waits
server.shutdown.pre_shutdown_delayfor deregistration. - Drains in-flight HTTP requests, and after a short settle closes live MCP sessions. Long-lived SSE and streamable-HTTP streams never go idle on their own, so until the session is closed the agent stays on the old build. Closing it drops the stream so the client reconnects to a new pod and re-fetches the tool list. The close is graceful: an idle session drops immediately, a session with an in-flight tool call is allowed to finish, bounded by the grace period (after which process exit drops what remains).
Relevant settings under server.shutdown are pre_shutdown_delay and
grace_period; size them so the full sequence fits inside the pod's
terminationGracePeriodSeconds.
Per-client behavior¶
| Client | On upgrade |
|---|---|
| Claude Code | Automatic. It honors notifications/tools/list_changed and auto-reconnects HTTP/SSE servers (exponential backoff). When the old session is closed it reconnects to the new build and re-fetches the tool list with no user action. |
| Claude Desktop | Requires a full app restart to pick up a changed tool list; it has no in-session refresh or reconnect action today. |
| claude.ai managed web connector | Caches the tool schema at the connector level; a connector re-sync (remove and re-add, or the workspace refresh) is needed to pick up changes. |
Keep upgrades safe¶
Because a client may still be running on a cached contract briefly after a deploy, keep tool/schema changes additive: adding a new optional parameter or a new tool is safe (a cached client simply does not see it until it refreshes). Renaming or removing a parameter, or removing a tool, breaks a client mid-session; deprecate across a release before removing.
Monitoring Setup¶
Prometheus ServiceMonitor¶
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: mcp-data-platform
namespace: mcp-data-platform
spec:
selector:
matchLabels:
app.kubernetes.io/name: mcp-data-platform
endpoints:
- port: metrics
interval: 30s
path: /metrics
Grafana Dashboard¶
Key metrics to monitor:
- Request rate:
sum(rate(mcp_requests_total[5m])) - Error rate:
sum(rate(mcp_requests_total{status="error"}[5m])) - Latency:
histogram_quantile(0.99, rate(mcp_request_duration_seconds_bucket[5m])) - Enrichment latency:
histogram_quantile(0.99, rate(mcp_enrichment_duration_seconds_bucket[5m])) - Active connections:
mcp_active_connections
Scaling Considerations¶
Horizontal Scaling¶
mcp-data-platform is stateless and scales horizontally. Key considerations:
- Connection pooling: Each replica maintains its own connections to DataHub/Trino
- Cache coordination: Semantic cache is per-instance; consider Redis for shared caching at scale
- Load balancing: Use sticky sessions for SSE connections
Vertical Scaling¶
Increase resources for:
- High query volume: More CPU for request processing
- Large result sets: More memory for enrichment processing
- Many concurrent connections: More memory for connection state
Bottleneck Analysis¶
Common bottlenecks and solutions:
| Bottleneck | Symptom | Solution |
|---|---|---|
| DataHub API | High enrichment latency | Enable caching, increase DataHub resources |
| Trino queries | Timeout errors | Tune Trino cluster, add query limits |
| PostgreSQL audit | Write latency | Use async writes, add replicas |
| Network | Connection timeouts | Deploy closer to data sources |