Examples Gallery¶
Practical configurations and patterns for common use cases. Each example includes complete YAML configurations and explanations of the design decisions.
Enterprise Data Governance¶
Compliance-Ready Audit Configuration¶
Full audit logging for regulatory compliance (SOC 2, HIPAA, GDPR).
server:
name: mcp-data-platform
transport: http
address: ":8443"
tls:
enabled: true
cert_file: /etc/tls/server.crt
key_file: /etc/tls/server.key
auth:
allow_anonymous: false
oidc:
enabled: true
issuer: "https://auth.example.com/realms/enterprise"
client_id: "mcp-data-platform"
audience: "mcp-data-platform"
role_claim_path: "realm_access.roles"
required_claims:
- sub
- exp
- email # Required for audit attribution
audit:
enabled: true
log_tool_calls: true
log_parameters: true # Log query parameters (redact PII)
log_results: false # Don't log result data
retention_days: 2555 # 7 years for compliance
# Async write to avoid blocking requests
async_writes: true
buffer_size: 1000
flush_interval: 5s
database:
dsn: ${DATABASE_URL}
max_open_conns: 25
max_idle_conns: 5
conn_max_lifetime: 5m
Key decisions:
- TLS required for all connections
- Email claim required for audit attribution
- Parameters logged (for query reconstruction) but results not logged (privacy)
- 7-year retention matches common compliance requirements
- Async writes prevent audit logging from blocking user requests
PII Detection and Acknowledgment¶
Configure the platform to surface PII warnings prominently.
toolkits:
datahub:
primary:
url: https://datahub.example.com
token: ${DATAHUB_TOKEN}
trino:
primary:
host: trino.example.com
port: 443
ssl: true
read_only: true # No write operations
injection:
trino_semantic_enrichment: true
# PII handling configuration
enrichment:
pii_tags:
- pii
- pii-email
- pii-phone
- pii-ssn
- gdpr-personal-data
# Require acknowledgment for PII queries
pii_acknowledgment:
enabled: true
message: |
This dataset contains PII. By proceeding, you acknowledge:
- Data will be handled per company privacy policy
- Access is logged for compliance
- Data must not be exported without approval
personas:
definitions:
analyst:
display_name: "Data Analyst"
roles: ["analyst"]
tools:
allow: ["trino_query", "trino_describe_*", "datahub_*"]
deny: ["*_delete_*", "*_drop_*"]
# Additional PII handling
restrictions:
max_rows: 10000
allowed_schemas:
- analytics
- aggregated
denied_tags:
- restricted
- executive-only
Role-Based Access with Keycloak¶
Complete Keycloak integration with multiple persona tiers.
auth:
oidc:
enabled: true
issuer: "https://keycloak.example.com/realms/data-platform"
client_id: "mcp-data-platform"
audience: "mcp-data-platform"
# Keycloak-specific configuration
role_claim_path: "realm_access.roles"
role_prefix: "dp_" # Only roles starting with dp_ are considered
# Security settings
clock_skew_seconds: 30
max_token_age: 8h
refresh_enabled: true
personas:
definitions:
# Tier 1: Read-only analysts
viewer:
display_name: "Data Viewer"
roles: ["dp_viewer"]
tools:
allow: ["datahub_search", "datahub_get_*"]
deny: ["*"]
# Tier 2: Query-capable analysts
analyst:
display_name: "Data Analyst"
roles: ["dp_analyst"]
tools:
allow: ["trino_query", "trino_list_*", "trino_describe_*", "datahub_*"]
deny: ["*_delete_*", "*_drop_*", "*_put_*"]
# Tier 3: Data engineers with write access
engineer:
display_name: "Data Engineer"
roles: ["dp_engineer"]
tools:
allow: ["trino_*", "datahub_*", "s3_*"]
deny: ["*_delete_*"]
# Tier 4: Administrators
admin:
display_name: "Administrator"
roles: ["dp_admin"]
tools:
allow: ["*"]
# Mapping for legacy role names
role_mapping:
oidc_to_persona:
"legacy_readonly": "viewer"
"legacy_analyst": "analyst"
default_persona: viewer
Read-Only Mode Enforcement¶
Lock down production data access to read-only operations.
toolkits:
trino:
production:
host: trino-prod.example.com
port: 443
ssl: true
ssl_verify: true
# Enforce read-only at the toolkit level
read_only: true
# Additional query restrictions
default_limit: 1000
max_limit: 50000
timeout: 300s
# Blocked SQL patterns (defense in depth)
blocked_patterns:
- "INSERT"
- "UPDATE"
- "DELETE"
- "DROP"
- "CREATE"
- "ALTER"
- "TRUNCATE"
s3:
data_lake:
region: us-east-1
# Read-only S3 access
read_only: true
# Restrict to specific buckets
allowed_buckets:
- data-lake-prod
- analytics-exports
Data Democratization¶
Self-Service Setup for Business Analysts¶
Configuration optimized for business users exploring data through AI.
server:
name: analytics-assistant
transport: http
address: ":8080"
toolkits:
datahub:
primary:
url: https://datahub.example.com
token: ${DATAHUB_TOKEN}
# Higher limits for exploration
default_limit: 25
max_limit: 100
trino:
analytics:
host: trino-analytics.example.com
port: 443
ssl: true
catalog: analytics
schema: curated
# Reasonable limits for interactive use
default_limit: 100
max_limit: 10000
timeout: 60s
read_only: true
# Enable all enrichment for maximum context
injection:
trino_semantic_enrichment: true
datahub_query_enrichment: true
semantic:
provider: datahub
instance: primary
# Aggressive caching for faster exploration
cache:
enabled: true
ttl: 15m
max_entries: 10000
personas:
definitions:
business_analyst:
display_name: "Business Analyst"
roles: ["analyst", "business"]
tools:
allow:
- "datahub_search"
- "datahub_get_entity"
- "datahub_get_schema"
- "datahub_get_lineage"
- "datahub_get_glossary_term"
- "datahub_list_domains"
- "trino_query"
- "trino_list_*"
- "trino_describe_*"
deny:
- "*_delete_*"
- "*_drop_*"
# User-friendly system prompt
prompts:
system_prefix: |
You are helping a business analyst explore and understand data.
Always explain what tables contain in business terms.
When showing query results, explain what the data means.
If data quality is below 80%, mention this to the user.
If a table is deprecated, always suggest the replacement.
default_persona: business_analyst
Cross-Team Data Discovery¶
Multi-Trino cluster setup for organization-wide data discovery.
toolkits:
datahub:
# Central metadata catalog
central:
url: https://datahub.example.com
token: ${DATAHUB_TOKEN}
trino:
# Marketing team's cluster
marketing:
host: trino-marketing.example.com
port: 443
ssl: true
catalog: marketing
default_limit: 1000
read_only: true
# Sales team's cluster
sales:
host: trino-sales.example.com
port: 443
ssl: true
catalog: sales
default_limit: 1000
read_only: true
# Finance team's cluster (restricted)
finance:
host: trino-finance.example.com
port: 443
ssl: true
catalog: finance
default_limit: 500
read_only: true
# Cross-injection from central DataHub to all Trino clusters
injection:
trino_semantic_enrichment: true
datahub_query_enrichment: true
semantic:
provider: datahub
instance: central
# Query provider maps DataHub URNs to correct Trino cluster
query:
provider: trino
mappings:
"urn:li:dataset:(urn:li:dataPlatform:trino,marketing.*,PROD)": marketing
"urn:li:dataset:(urn:li:dataPlatform:trino,sales.*,PROD)": sales
"urn:li:dataset:(urn:li:dataPlatform:trino,finance.*,PROD)": finance
personas:
definitions:
cross_team_analyst:
display_name: "Cross-Team Analyst"
roles: ["cross_team"]
tools:
allow:
- "datahub_*"
- "trino_query:marketing" # Explicit cluster access
- "trino_query:sales"
- "trino_list_*"
- "trino_describe_*"
deny:
- "trino_query:finance" # No finance access
- "*_delete_*"
finance_analyst:
display_name: "Finance Analyst"
roles: ["finance"]
tools:
allow:
- "datahub_*"
- "trino_*" # All clusters including finance
deny:
- "*_delete_*"
New Employee Onboarding Workflow¶
Configuration with helpful prompts for new team members.
personas:
definitions:
new_hire:
display_name: "New Team Member"
roles: ["new_hire", "onboarding"]
tools:
allow:
- "datahub_search"
- "datahub_get_entity"
- "datahub_get_schema"
- "datahub_get_lineage"
- "datahub_get_glossary_term"
- "datahub_list_domains"
- "datahub_list_tags"
- "trino_list_*"
- "trino_describe_*"
# No direct query access yet
deny:
- "trino_query"
- "*_delete_*"
prompts:
system_prefix: |
You are onboarding a new team member to our data platform.
When they ask about data:
1. Start with the domain (Sales, Marketing, Finance, etc.)
2. Explain what the domain contains
3. Show key tables and their purposes
4. Point out data owners they can contact
5. Highlight any data quality concerns
Always recommend they review the glossary terms for unfamiliar concepts.
If they want to query data, explain they need to complete onboarding first.
onboarding_resources: |
Useful resources for new team members:
- Data Glossary: /glossary
- Domain Owners: /domains
- Data Quality Dashboard: /quality
- Request Access: /access-request
AI/ML Workflows¶
AI Agent Exploring Unfamiliar Datasets¶
Configuration for autonomous AI agents discovering data.
server:
name: ml-data-explorer
transport: http
address: ":8080"
toolkits:
datahub:
primary:
url: https://datahub.example.com
token: ${DATAHUB_TOKEN}
default_limit: 50 # More results for exploration
trino:
ml_cluster:
host: trino-ml.example.com
port: 443
ssl: true
catalog: feature_store
read_only: true
default_limit: 100
max_limit: 10000
injection:
trino_semantic_enrichment: true
datahub_query_enrichment: true
# Full lineage depth for understanding data provenance
semantic:
provider: datahub
instance: primary
lineage:
max_depth: 5
include_column_lineage: true
personas:
definitions:
ml_agent:
display_name: "ML Data Agent"
roles: ["ml_agent", "automated"]
tools:
allow:
- "datahub_search"
- "datahub_get_entity"
- "datahub_get_schema"
- "datahub_get_lineage"
- "datahub_get_queries" # See popular queries
- "datahub_list_data_products"
- "trino_query"
- "trino_list_*"
- "trino_describe_*"
- "trino_explain" # Query planning
deny:
- "*_delete_*"
- "*_put_*"
# Rate limiting for automated access
rate_limit:
requests_per_minute: 60
requests_per_hour: 1000
prompts:
system_prefix: |
You are an ML data exploration agent. Your goal is to discover
and evaluate datasets for machine learning use cases.
When exploring:
1. Check data quality scores (reject < 70%)
2. Verify data freshness (check last_updated)
3. Trace lineage to understand transformations
4. Look for feature-ready columns (numeric, categorical)
5. Note any PII tags that require handling
Always document your findings with URNs for reproducibility.
Feature Store Integration¶
Connecting to a feature store with quality gates.
toolkits:
trino:
feature_store:
host: trino-features.example.com
port: 443
ssl: true
catalog: feature_store
schema: production
read_only: true
injection:
trino_semantic_enrichment: true
# Quality gates for feature selection
quality_gates:
enabled: true
minimum_quality_score: 0.75
maximum_null_percentage: 10
require_documentation: true
# Warn but don't block
soft_gates:
- name: freshness
max_age_days: 7
message: "Feature data is over 7 days old"
- name: coverage
min_coverage: 0.90
message: "Feature coverage is below 90%"
# Block features that don't meet criteria
hard_gates:
- name: deprecated
message: "Cannot use deprecated features"
- name: pii
message: "PII features require explicit approval"
personas:
definitions:
ml_engineer:
display_name: "ML Engineer"
roles: ["ml_engineer"]
tools:
allow:
- "datahub_*"
- "trino_*"
deny:
- "*_delete_*"
prompts:
system_prefix: |
You are helping an ML engineer select features from the feature store.
For each feature, report:
- Quality score and null percentage
- Last update time
- Upstream dependencies (lineage)
- Any quality gate warnings
Reject features that fail hard quality gates.
Flag features with soft gate warnings but allow their use.
Pipeline Lineage Exploration¶
Configuration for understanding data pipeline provenance.
semantic:
provider: datahub
instance: primary
lineage:
max_depth: 10 # Deep lineage traversal
include_column_lineage: true # Column-level lineage
include_process_lineage: true # Show transformation steps
cache:
enabled: true
ttl: 5m
# Separate cache for lineage queries
lineage_ttl: 1m # Shorter TTL for lineage
injection:
trino_semantic_enrichment: true
# Include full lineage in enrichment
enrichment:
include_lineage: true
lineage_depth: 3
lineage_direction: both
personas:
definitions:
data_engineer:
display_name: "Data Engineer"
roles: ["data_engineer"]
tools:
allow:
- "datahub_get_lineage"
- "datahub_get_entity"
- "datahub_search"
- "trino_*"
deny:
- "*_delete_*"
prompts:
system_prefix: |
You are helping a data engineer understand data lineage.
When showing lineage:
1. Start with the requested entity
2. Show immediate upstream sources
3. Show immediate downstream consumers
4. Highlight any transformation steps
5. Note data quality changes through the pipeline
Use URNs consistently for reference.
Integration Patterns¶
Multi-Provider Configuration¶
Connect multiple instances of each service type.
toolkits:
datahub:
# Production DataHub
production:
url: https://datahub.example.com
token: ${DATAHUB_TOKEN_PROD}
# Development DataHub
development:
url: https://datahub-dev.example.com
token: ${DATAHUB_TOKEN_DEV}
trino:
# Production Trino (read-only)
production:
host: trino.example.com
port: 443
ssl: true
read_only: true
# Development Trino (read-write)
development:
host: trino-dev.example.com
port: 8080
ssl: false
read_only: false
# Analytics Trino
analytics:
host: trino-analytics.example.com
port: 443
ssl: true
read_only: true
s3:
# AWS S3
aws:
region: us-east-1
access_key_id: ${AWS_ACCESS_KEY_ID}
secret_access_key: ${AWS_SECRET_ACCESS_KEY}
# MinIO (on-prem)
minio:
endpoint: http://minio.local:9000
use_path_style: true
disable_ssl: true
access_key_id: ${MINIO_ACCESS_KEY}
secret_access_key: ${MINIO_SECRET_KEY}
# Configure which instances to use for enrichment
semantic:
provider: datahub
instance: production # Use production DataHub for metadata
query:
provider: trino
instance: production # Use production Trino for availability checks
storage:
provider: s3
instance: aws # Use AWS S3 for storage checks
Custom Toolkit Development¶
Example structure for adding a custom toolkit.
package custom
import (
"context"
"github.com/modelcontextprotocol/go-sdk/mcp"
"github.com/txn2/mcp-data-platform/pkg/semantic"
"github.com/txn2/mcp-data-platform/pkg/query"
)
// Toolkit implements the registry.Toolkit interface
type Toolkit struct {
name string
config Config
semanticProvider semantic.Provider
queryProvider query.Provider
}
func (t *Toolkit) Kind() string { return "custom" }
func (t *Toolkit) Name() string { return t.name }
func (t *Toolkit) Tools() []string {
return []string{
"custom_operation_one",
"custom_operation_two",
}
}
func (t *Toolkit) RegisterTools(s *mcp.Server) {
s.AddTool(mcp.Tool{
Name: "custom_operation_one",
Description: "Perform custom operation one",
InputSchema: mcp.ToolInputSchema{
Type: "object",
Properties: map[string]any{
"input": map[string]any{
"type": "string",
"description": "Input value",
},
},
Required: []string{"input"},
},
}, t.handleOperationOne)
}
func (t *Toolkit) SetSemanticProvider(p semantic.Provider) {
t.semanticProvider = p
}
func (t *Toolkit) SetQueryProvider(p query.Provider) {
t.queryProvider = p
}
func (t *Toolkit) Close() error {
return nil
}
Register the custom toolkit in configuration:
toolkits:
custom:
my_custom:
# Custom toolkit configuration
api_endpoint: https://custom-api.example.com
api_key: ${CUSTOM_API_KEY}
Next Steps¶
- Configuration Reference - Full configuration schema
- Tools API - Complete tool documentation
- Troubleshooting - Common issues and solutions