Deployment Guide¶

This guide covers deploying mcp-data-platform in various environments, from local development to production Kubernetes clusters.

Deployment Options¶

Environment	Best For	Complexity
Docker Compose	Development, small teams, testing	Low
Kubernetes/Helm	Production, multi-user, enterprise	Medium

Docker Compose (Development/Small Teams)¶

A complete full-stack deployment including DataHub, Trino, mcp-data-platform, Keycloak, and PostgreSQL.

Prerequisites¶

Docker 24.0+
Docker Compose 2.20+
16GB RAM minimum (DataHub requires significant memory)
20GB free disk space

Full-Stack Example¶

Create a docker-compose.yml:

services:
  # PostgreSQL for metadata storage
  postgres:
    image: postgres:16-alpine@sha256:acf5271bce6b4b62e352341e3b175c2b1e9e0b6f6e3f2e7e3b7f8c9d0e1f2a3b
    environment:
      POSTGRES_USER: postgres
      POSTGRES_PASSWORD: ${POSTGRES_PASSWORD:-postgres}
      POSTGRES_MULTIPLE_DATABASES: datahub,keycloak,audit
    volumes:
      - postgres_data:/var/lib/postgresql/data
      - ./init-multiple-dbs.sh:/docker-entrypoint-initdb.d/init-multiple-dbs.sh
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U postgres"]
      interval: 10s
      timeout: 5s
      retries: 5

  # Keycloak for authentication
  keycloak:
    image: quay.io/keycloak/keycloak:24.0@sha256:b3c4a5d6e7f8a9b0c1d2e3f4a5b6c7d8e9f0a1b2c3d4e5f6a7b8c9d0e1f2a3b4
    command: start-dev --import-realm
    environment:
      KC_DB: postgres
      KC_DB_URL: jdbc:postgresql://postgres:5432/keycloak
      KC_DB_USERNAME: postgres
      KC_DB_PASSWORD: ${POSTGRES_PASSWORD:-postgres}
      KEYCLOAK_ADMIN: admin
      KEYCLOAK_ADMIN_PASSWORD: ${KEYCLOAK_ADMIN_PASSWORD:-admin}
    volumes:
      - ./keycloak-realm.json:/opt/keycloak/data/import/realm.json
    ports:
      - "8180:8080"
    depends_on:
      postgres:
        condition: service_healthy

  # DataHub GMS (Metadata Service)
  datahub-gms:
    image: acryldata/datahub-gms:v0.13.0@sha256:c1d2e3f4a5b6c7d8e9f0a1b2c3d4e5f6a7b8c9d0e1f2a3b4c5d6e7f8a9b0c1d2
    environment:
      DATAHUB_GMS_HOST: datahub-gms
      DATAHUB_GMS_PORT: 8080
      EBEAN_DATASOURCE_HOST: postgres:5432
      EBEAN_DATASOURCE_USERNAME: postgres
      EBEAN_DATASOURCE_PASSWORD: ${POSTGRES_PASSWORD:-postgres}
      ELASTICSEARCH_HOST: elasticsearch
      ELASTICSEARCH_PORT: 9200
      KAFKA_BOOTSTRAP_SERVER: kafka:9092
      KAFKA_SCHEMAREGISTRY_URL: http://schema-registry:8081
    depends_on:
      postgres:
        condition: service_healthy
      elasticsearch:
        condition: service_healthy
      kafka:
        condition: service_healthy
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8080/health"]
      interval: 30s
      timeout: 10s
      retries: 5

  # Elasticsearch for DataHub search
  elasticsearch:
    image: elasticsearch:7.17.18@sha256:a1b2c3d4e5f6a7b8c9d0e1f2a3b4c5d6e7f8a9b0c1d2e3f4a5b6c7d8e9f0a1b2
    environment:
      - discovery.type=single-node
      - xpack.security.enabled=false
      - ES_JAVA_OPTS=-Xms512m -Xmx512m
    volumes:
      - elasticsearch_data:/usr/share/elasticsearch/data
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:9200/_cluster/health"]
      interval: 10s
      timeout: 5s
      retries: 10

  # Kafka for DataHub events
  kafka:
    image: confluentinc/cp-kafka:7.6.0@sha256:b1c2d3e4f5a6b7c8d9e0f1a2b3c4d5e6f7a8b9c0d1e2f3a4b5c6d7e8f9a0b1c2
    environment:
      KAFKA_BROKER_ID: 1
      KAFKA_ZOOKEEPER_CONNECT: zookeeper:2181
      KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://kafka:9092
      KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 1
    depends_on:
      - zookeeper
    healthcheck:
      test: ["CMD", "kafka-topics", "--bootstrap-server", "kafka:9092", "--list"]
      interval: 30s
      timeout: 10s
      retries: 5

  # Zookeeper for Kafka
  zookeeper:
    image: confluentinc/cp-zookeeper:7.6.0@sha256:a2b3c4d5e6f7a8b9c0d1e2f3a4b5c6d7e8f9a0b1c2d3e4f5a6b7c8d9e0f1a2b3
    environment:
      ZOOKEEPER_CLIENT_PORT: 2181
      ZOOKEEPER_TICK_TIME: 2000

  # Schema Registry for Kafka
  schema-registry:
    image: confluentinc/cp-schema-registry:7.6.0@sha256:c3d4e5f6a7b8c9d0e1f2a3b4c5d6e7f8a9b0c1d2e3f4a5b6c7d8e9f0a1b2c3d4
    environment:
      SCHEMA_REGISTRY_HOST_NAME: schema-registry
      SCHEMA_REGISTRY_KAFKASTORE_BOOTSTRAP_SERVERS: kafka:9092
    depends_on:
      kafka:
        condition: service_healthy

  # Trino for SQL queries
  trino:
    image: trinodb/trino:440@sha256:d4e5f6a7b8c9d0e1f2a3b4c5d6e7f8a9b0c1d2e3f4a5b6c7d8e9f0a1b2c3d4e5
    ports:
      - "8081:8080"
    volumes:
      - ./trino-catalog:/etc/trino/catalog
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8080/v1/info"]
      interval: 10s
      timeout: 5s
      retries: 10

  # MCP Data Platform
  mcp-data-platform:
    image: ghcr.io/txn2/mcp-data-platform:latest
    environment:
      DATAHUB_TOKEN: ${DATAHUB_TOKEN}
      DATABASE_URL: postgres://postgres:${POSTGRES_PASSWORD:-postgres}@postgres:5432/audit
      OAUTH_SIGNING_KEY: ${OAUTH_SIGNING_KEY}
      KEYCLOAK_CLIENT_SECRET: ${KEYCLOAK_CLIENT_SECRET}
    volumes:
      - ./platform.yaml:/etc/mcp/platform.yaml:ro
    command: ["--config", "/etc/mcp/platform.yaml", "--transport", "http", "--address", ":8080"]
    ports:
      - "8080:8080"
    depends_on:
      datahub-gms:
        condition: service_healthy
      trino:
        condition: service_healthy
      keycloak:
        condition: service_started

volumes:
  postgres_data:
  elasticsearch_data:

Platform Configuration¶

Create platform.yaml:

server:
  name: mcp-data-platform
  transport: http
  address: ":8080"

toolkits:
  datahub:
    primary:
      url: http://datahub-gms:8080
      token: ${DATAHUB_TOKEN}

  trino:
    primary:
      host: trino
      port: 8080
      user: trino
      catalog: memory
      ssl: false

oauth:
  enabled: true
  issuer: "http://localhost:8080"
  signing_key: ${OAUTH_SIGNING_KEY}
  clients:
    - id: "claude-desktop"
      secret: "claude-secret"
      redirect_uris:
        - "http://localhost"
        - "http://127.0.0.1"
  upstream:
    issuer: "http://keycloak:8080/realms/mcp"
    client_id: "mcp-data-platform"
    client_secret: ${KEYCLOAK_CLIENT_SECRET}
    redirect_uri: "http://localhost:8080/oauth/callback"

personas:
  definitions:
    analyst:
      display_name: "Data Analyst"
      roles: ["analyst"]
      tools:
        allow: ["trino_*", "datahub_*"]
        deny: ["*_delete_*"]
    admin:
      display_name: "Administrator"
      roles: ["admin"]
      tools:
        allow: ["*"]
  default_persona: analyst

enrichment:
  trino_semantic_enrichment: true
  datahub_query_enrichment: true
  column_context_filtering: true   # Only enrich columns referenced in SQL (default: true)

audit:
  enabled: true
  log_tool_calls: true

database:
  dsn: ${DATABASE_URL}

Start the Stack¶

# Generate secrets
export POSTGRES_PASSWORD=$(openssl rand -base64 32)
export OAUTH_SIGNING_KEY=$(openssl rand -base64 32)
export KEYCLOAK_CLIENT_SECRET=$(openssl rand -base64 32)
export DATAHUB_TOKEN="your-datahub-token"

# Start all services
docker compose up -d

# Wait for services to be healthy
docker compose ps

# View logs
docker compose logs -f mcp-data-platform

Local Development Workflow¶

For rapid iteration during development:

# Start dependencies only
docker compose up -d postgres elasticsearch kafka zookeeper schema-registry datahub-gms trino keycloak

# Run mcp-data-platform locally
go run ./cmd/mcp-data-platform --config platform.yaml --transport http --address :8080

Kubernetes/Helm (Production)¶

Production deployment using Helm charts with best practices for security, scaling, and monitoring.

Prerequisites¶

Kubernetes 1.28+
Helm 3.12+
kubectl configured for your cluster
TLS certificates (cert-manager recommended)

Helm Chart Structure¶

Create a Helm chart at charts/mcp-data-platform/:

charts/mcp-data-platform/
├── Chart.yaml
├── values.yaml
├── templates/
│   ├── _helpers.tpl
│   ├── deployment.yaml
│   ├── service.yaml
│   ├── configmap.yaml
│   ├── secret.yaml
│   ├── ingress.yaml
│   ├── hpa.yaml
│   ├── pdb.yaml
│   └── serviceaccount.yaml

Chart.yaml¶

apiVersion: v2
name: mcp-data-platform
description: Semantic data platform MCP server
type: application
version: 1.0.0
appVersion: "0.1.0"

values.yaml¶

replicaCount: 2

image:
  repository: ghcr.io/txn2/mcp-data-platform
  pullPolicy: IfNotPresent
  tag: "latest"

serviceAccount:
  create: true
  annotations: {}
  name: ""

podSecurityContext:
  runAsNonRoot: true
  runAsUser: 65534
  runAsGroup: 65534
  fsGroup: 65534

securityContext:
  allowPrivilegeEscalation: false
  readOnlyRootFilesystem: true
  capabilities:
    drop:
      - ALL

service:
  type: ClusterIP
  port: 8080

ingress:
  enabled: true
  className: nginx
  annotations:
    cert-manager.io/cluster-issuer: letsencrypt-prod
    nginx.ingress.kubernetes.io/proxy-body-size: "10m"
    nginx.ingress.kubernetes.io/proxy-read-timeout: "300"
  hosts:
    - host: mcp.example.com
      paths:
        - path: /
          pathType: Prefix
  tls:
    - secretName: mcp-data-platform-tls
      hosts:
        - mcp.example.com

resources:
  limits:
    cpu: 1000m
    memory: 512Mi
  requests:
    cpu: 100m
    memory: 128Mi

autoscaling:
  enabled: true
  minReplicas: 2
  maxReplicas: 10
  targetCPUUtilizationPercentage: 70
  targetMemoryUtilizationPercentage: 80

pdb:
  enabled: true
  minAvailable: 1

# Platform configuration
config:
  server:
    name: mcp-data-platform
    transport: http
    address: ":8080"
    tls:
      enabled: false  # TLS terminates at ingress

  toolkits:
    datahub:
      primary:
        url: http://datahub-gms.datahub:8080
    trino:
      primary:
        host: trino.trino
        port: 8080
        user: mcp-platform
        catalog: hive
        ssl: false

  enrichment:
    trino_semantic_enrichment: true
    datahub_query_enrichment: true
    column_context_filtering: true   # Only enrich columns referenced in SQL (default: true)

  audit:
    enabled: true
    log_tool_calls: true

# External secrets (use external-secrets operator or sealed-secrets in production)
secrets:
  datahubToken: ""
  oauthSigningKey: ""
  keycloakClientSecret: ""
  databaseUrl: ""

# Prometheus metrics
metrics:
  enabled: true
  port: 9090
  path: /metrics

# Health checks
probes:
  liveness:
    httpGet:
      path: /health
      port: http
    initialDelaySeconds: 10
    periodSeconds: 10
  readiness:
    httpGet:
      path: /health
      port: http
    initialDelaySeconds: 5
    periodSeconds: 5

templates/deployment.yaml¶

apiVersion: apps/v1
kind: Deployment
metadata:
  name: {{ include "mcp-data-platform.fullname" . }}
  labels:
    {{- include "mcp-data-platform.labels" . | nindent 4 }}
spec:
  {{- if not .Values.autoscaling.enabled }}
  replicas: {{ .Values.replicaCount }}
  {{- end }}
  selector:
    matchLabels:
      {{- include "mcp-data-platform.selectorLabels" . | nindent 6 }}
  template:
    metadata:
      annotations:
        checksum/config: {{ include (print $.Template.BasePath "/configmap.yaml") . | sha256sum }}
        checksum/secret: {{ include (print $.Template.BasePath "/secret.yaml") . | sha256sum }}
      labels:
        {{- include "mcp-data-platform.selectorLabels" . | nindent 8 }}
    spec:
      serviceAccountName: {{ include "mcp-data-platform.serviceAccountName" . }}
      securityContext:
        {{- toYaml .Values.podSecurityContext | nindent 8 }}
      containers:
        - name: {{ .Chart.Name }}
          securityContext:
            {{- toYaml .Values.securityContext | nindent 12 }}
          image: "{{ .Values.image.repository }}:{{ .Values.image.tag | default .Chart.AppVersion }}"
          imagePullPolicy: {{ .Values.image.pullPolicy }}
          args:
            - --config
            - /etc/mcp/platform.yaml
            - --transport
            - http
            - --address
            - :8080
          ports:
            - name: http
              containerPort: 8080
              protocol: TCP
            {{- if .Values.metrics.enabled }}
            - name: metrics
              containerPort: {{ .Values.metrics.port }}
              protocol: TCP
            {{- end }}
          livenessProbe:
            {{- toYaml .Values.probes.liveness | nindent 12 }}
          readinessProbe:
            {{- toYaml .Values.probes.readiness | nindent 12 }}
          resources:
            {{- toYaml .Values.resources | nindent 12 }}
          env:
            - name: DATAHUB_TOKEN
              valueFrom:
                secretKeyRef:
                  name: {{ include "mcp-data-platform.fullname" . }}
                  key: datahub-token
            - name: OAUTH_SIGNING_KEY
              valueFrom:
                secretKeyRef:
                  name: {{ include "mcp-data-platform.fullname" . }}
                  key: oauth-signing-key
            - name: KEYCLOAK_CLIENT_SECRET
              valueFrom:
                secretKeyRef:
                  name: {{ include "mcp-data-platform.fullname" . }}
                  key: keycloak-client-secret
            - name: DATABASE_URL
              valueFrom:
                secretKeyRef:
                  name: {{ include "mcp-data-platform.fullname" . }}
                  key: database-url
          volumeMounts:
            - name: config
              mountPath: /etc/mcp
              readOnly: true
            - name: tmp
              mountPath: /tmp
      volumes:
        - name: config
          configMap:
            name: {{ include "mcp-data-platform.fullname" . }}
        - name: tmp
          emptyDir: {}

templates/hpa.yaml¶

{{- if .Values.autoscaling.enabled }}
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: {{ include "mcp-data-platform.fullname" . }}
  labels:
    {{- include "mcp-data-platform.labels" . | nindent 4 }}
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: {{ include "mcp-data-platform.fullname" . }}
  minReplicas: {{ .Values.autoscaling.minReplicas }}
  maxReplicas: {{ .Values.autoscaling.maxReplicas }}
  metrics:
    {{- if .Values.autoscaling.targetCPUUtilizationPercentage }}
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: {{ .Values.autoscaling.targetCPUUtilizationPercentage }}
    {{- end }}
    {{- if .Values.autoscaling.targetMemoryUtilizationPercentage }}
    - type: Resource
      resource:
        name: memory
        target:
          type: Utilization
          averageUtilization: {{ .Values.autoscaling.targetMemoryUtilizationPercentage }}
    {{- end }}
{{- end }}

templates/pdb.yaml¶

{{- if .Values.pdb.enabled }}
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: {{ include "mcp-data-platform.fullname" . }}
  labels:
    {{- include "mcp-data-platform.labels" . | nindent 4 }}
spec:
  minAvailable: {{ .Values.pdb.minAvailable }}
  selector:
    matchLabels:
      {{- include "mcp-data-platform.selectorLabels" . | nindent 6 }}
{{- end }}

Deploy to Kubernetes¶

# Create namespace
kubectl create namespace mcp-data-platform

# Create secrets (use external-secrets or sealed-secrets in production)
kubectl create secret generic mcp-data-platform-secrets \
  --namespace mcp-data-platform \
  --from-literal=datahub-token="$DATAHUB_TOKEN" \
  --from-literal=oauth-signing-key="$OAUTH_SIGNING_KEY" \
  --from-literal=keycloak-client-secret="$KEYCLOAK_CLIENT_SECRET" \
  --from-literal=database-url="$DATABASE_URL"

# Install the chart
helm upgrade --install mcp-data-platform ./charts/mcp-data-platform \
  --namespace mcp-data-platform \
  --values values-production.yaml

# Verify deployment
kubectl get pods -n mcp-data-platform
kubectl get hpa -n mcp-data-platform

Production Checklist¶

Security¶

TLS enabled for all external endpoints
Secrets stored in external secrets manager (Vault, AWS Secrets Manager)
Network policies restrict pod-to-pod communication
Pod security context configured (non-root, read-only filesystem)
Resource limits set for all containers
OIDC configured with production identity provider
API keys rotated regularly

High Availability¶

Multiple replicas deployed (minimum 2)
PodDisruptionBudget configured
Anti-affinity rules spread pods across nodes
Health checks configured for liveness and readiness
HPA configured for automatic scaling

Monitoring¶

Prometheus metrics enabled and scraped
Grafana dashboards deployed
Alerting rules configured
Log aggregation set up (ELK, Loki)
Distributed tracing enabled (Jaeger, Zipkin)

Operations¶

Backup strategy for PostgreSQL audit logs
Disaster recovery plan documented
Runbooks for common issues
On-call rotation established

MCP gateway (if enabled)¶

The gateway toolkit (kind mcp) has additional production requirements:

ENCRYPTION_KEY is set (32 bytes of key material; accepted as 64 hex characters, 44-character base64, or 32 raw bytes). Required for at-rest encryption of stored credentials, OAuth access and refresh tokens (gateway_oauth_tokens), and PKCE state (oauth_pkce_states.code_verifier). Without it the platform logs a warning and stores those values in plaintext — not acceptable in production.
PostgreSQL is reachable from every replica and shared. Multi-replica deployments rely on the Postgres-backed PKCE state store so an oauth-start on replica A and the redirect callback on replica B can find each other. The platform automatically uses Postgres when database.dsn is set.
OAuth callback path (/api/v1/admin/oauth/callback) is reachable on the public-facing URL of the platform. The upstream OAuth provider redirects the operator's browser here after sign-in; the path is intentionally public (state token authenticates the callback) and must be allowed through any reverse-proxy auth.
External Client App / OAuth client registration on each upstream lists the platform's /api/v1/admin/oauth/callback URL as an allowed redirect URI. Required for authorization_code grants (e.g. Salesforce Hosted MCP).
ENCRYPTION_KEY rotation plan. Rotating the key invalidates every encrypted value in connection_instances, gateway_oauth_tokens, and oauth_pkce_states — gateway connections will lose their stored credentials and authorization_code connections will need to be re-Connected through the portal. Plan accordingly.

Upgrades and connected agents¶

The platform ships frequently, and each upgrade can change the tool contract (new tools, new parameters, updated descriptions). How a connected agent picks up the new contract depends on its client, because MCP delivers a changed tool list in-band only on a live session; a binary upgrade is a new process, so the agent must reconnect to re-handshake (initialize + tools/list) against the new build.

What the server does on shutdown¶

On SIGTERM (a rolling deploy), the server:

Marks readiness draining so the load balancer stops routing new connections, then waits server.shutdown.pre_shutdown_delay for deregistration.
Drains in-flight HTTP requests, and after a short settle closes live MCP sessions. Long-lived SSE and streamable-HTTP streams never go idle on their own, so until the session is closed the agent stays on the old build. Closing it drops the stream so the client reconnects to a new pod and re-fetches the tool list. The close is graceful: an idle session drops immediately, a session with an in-flight tool call is allowed to finish, bounded by the grace period (after which process exit drops what remains).

Relevant settings under server.shutdown are pre_shutdown_delay and grace_period; size them so the full sequence fits inside the pod's terminationGracePeriodSeconds.

Per-client behavior¶

Client	On upgrade
Claude Code	Automatic. It honors `notifications/tools/list_changed` and auto-reconnects HTTP/SSE servers (exponential backoff). When the old session is closed it reconnects to the new build and re-fetches the tool list with no user action.
Claude Desktop	Requires a full app restart to pick up a changed tool list; it has no in-session refresh or reconnect action today.
claude.ai managed web connector	Caches the tool schema at the connector level; a connector re-sync (remove and re-add, or the workspace refresh) is needed to pick up changes.

Keep upgrades safe¶

Because a client may still be running on a cached contract briefly after a deploy, keep tool/schema changes additive: adding a new optional parameter or a new tool is safe (a cached client simply does not see it until it refreshes). Renaming or removing a parameter, or removing a tool, breaks a client mid-session; deprecate across a release before removing.

Monitoring Setup¶

Prometheus ServiceMonitor¶

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: mcp-data-platform
  namespace: mcp-data-platform
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: mcp-data-platform
  endpoints:
    - port: metrics
      interval: 30s
      path: /metrics

Grafana Dashboard¶

Key metrics to monitor:

Request rate: sum(rate(mcp_requests_total[5m]))
Error rate: sum(rate(mcp_requests_total{status="error"}[5m]))
Latency: histogram_quantile(0.99, rate(mcp_request_duration_seconds_bucket[5m]))
Enrichment latency: histogram_quantile(0.99, rate(mcp_enrichment_duration_seconds_bucket[5m]))
Active connections: mcp_active_connections

Scaling Considerations¶

Horizontal Scaling¶

mcp-data-platform is stateless and scales horizontally. Key considerations:

Connection pooling: Each replica maintains its own connections to DataHub/Trino
Cache coordination: Semantic cache is per-instance; consider Redis for shared caching at scale
Load balancing: Use sticky sessions for SSE connections

Vertical Scaling¶

Increase resources for:

High query volume: More CPU for request processing
Large result sets: More memory for enrichment processing
Many concurrent connections: More memory for connection state

Bottleneck Analysis¶

Common bottlenecks and solutions:

Bottleneck	Symptom	Solution
DataHub API	High enrichment latency	Enable caching, increase DataHub resources
Trino queries	Timeout errors	Tune Trino cluster, add query limits
PostgreSQL audit	Write latency	Use async writes, add replicas
Network	Connection timeouts	Deploy closer to data sources