mcp-data-platform¶
Your AI assistant can run SQL. But it doesn't know that cust_id contains PII, that the table was deprecated last month, or who to ask when something breaks.
mcp-data-platform fixes that. Connect AI assistants to your data infrastructure with automatic business context from your semantic layer. Query a table and get its meaning, owners, quality scores, and deprecation warnings in the same response.
The only requirement is DataHub. Add Trino for SQL queries and S3 for object storage when you're ready. Learn why this stack →
-
Semantic-First
DataHub is the foundation. Query a table, get its business context automatically: owners, tags, quality scores, deprecation warnings. No separate lookups needed.
-
Cross-Injection
Trino results include DataHub metadata. DataHub searches show which datasets are queryable. Context flows between services automatically.
-
Enterprise Security
Fail-closed authentication, TLS enforcement, prompt injection protection, and read-only mode. Built for production environments.
-
Personas
Define who can use which tools. Analysts get read access. Admins get everything. Map from your identity provider's roles.
-
Audit Logging
Every tool call logged to PostgreSQL: who called what, when, how long, success or failure. Automatic retention cleanup and parameter sanitization.
-
MCP Apps
Interactive UI components for tool results. Tables, charts, and filters rendered in iframes alongside AI responses.
The Problem We Solve¶
When AI assistants query your data, they work blind. They see column names and types, but they don't know what the data means.
Consider this scenario: An AI assistant is helping a new analyst understand customer data. Without semantic context:
- "What tables have customer information?" → Returns table names
- "Describe the customers table" → Returns columns and types
- "Is this data up to date?" → Requires separate DataHub query
- "Who should I contact about issues?" → Requires another DataHub query
- "Are there data quality concerns?" → Yet another query
- "Wait, is this table deprecated?" → Finally discovers the table was deprecated months ago
With mcp-data-platform, the first response includes everything:
Table: customers (DEPRECATED - use customers_v2)
─────────────────────────────────────────────────
Owners: Data Platform Team
Domain: Customer Analytics
Tags: pii, gdpr-relevant
Quality Score: 67% (degraded)
Last Updated: 2 weeks ago
Columns:
customer_id BIGINT Unique customer identifier
email VARCHAR Contact email (PII - handle per GDPR policy)
created_at TIMESTAMP Account creation timestamp
segment VARCHAR Marketing segment (see glossary: customer-segment)
One call. Complete context. Warnings front and center.
See It In Action¶
Before: Raw Trino Response¶
{
"columns": [
{"name": "customer_id", "type": "BIGINT"},
{"name": "email", "type": "VARCHAR"},
{"name": "segment", "type": "VARCHAR"}
]
}
Just structure. No meaning. No warnings. No ownership.
After: Enriched Response with mcp-data-platform¶
{
"columns": [
{"name": "customer_id", "type": "BIGINT"},
{"name": "email", "type": "VARCHAR"},
{"name": "segment", "type": "VARCHAR"}
],
"semantic_context": {
"description": "Core customer records with PII data",
"deprecation": {
"deprecated": true,
"note": "Use customers_v2 for GDPR compliance",
"replacement": "urn:li:dataset:customers_v2"
},
"owners": [
{"name": "Data Platform Team", "type": "group", "email": "[email protected]"}
],
"tags": ["pii", "gdpr-relevant"],
"domain": {"name": "Customer Analytics", "urn": "urn:li:domain:customer"},
"quality_score": 0.67,
"columns": {
"customer_id": {"description": "Unique customer identifier", "tags": []},
"email": {"description": "Contact email", "tags": ["pii"], "glossary_term": "customer-email"},
"segment": {"description": "Marketing segment", "glossary_term": "customer-segment"}
}
}
}
The AI assistant now knows:
- This table is deprecated with a clear migration path
- It contains PII requiring special handling
- Who to contact when something is wrong
- The data quality has degraded (67%)
- What each column actually means
How It Works¶
sequenceDiagram
participant AI as AI Assistant
participant P as mcp-data-platform
participant T as Trino
participant D as DataHub
AI->>P: trino_describe_table "customers"
P->>T: DESCRIBE customers
T-->>P: columns, types
P->>D: Get semantic context
D-->>P: description, owners, tags, quality, deprecation
P-->>AI: Schema + Full Business Context
The enrichment middleware intercepts tool responses and adds semantic context before returning to the client. This cross-injection works in both directions:
| When you use | You also get |
|---|---|
| Trino | DataHub metadata (owners, tags, quality, deprecation) |
| DataHub search | Which datasets are queryable in Trino |
| S3 | DataHub metadata for matching datasets |
Lineage-Aware Inheritance¶
Downstream datasets often lack documentation even when their upstream sources are well-documented. The platform automatically inherits column metadata from upstream tables via DataHub lineage.
sequenceDiagram
participant AI as AI Assistant
participant P as mcp-data-platform
participant D as DataHub
AI->>P: trino_describe_table "elasticsearch.sales"
P->>D: Get schema (no descriptions)
P->>D: Get upstream lineage
D-->>P: cassandra.system_sale (1 hop)
P->>D: Get upstream schema (has descriptions)
P-->>AI: Schema + Inherited Column Context
What Gets Inherited¶
| Metadata | Example |
|---|---|
| Descriptions | "Net sale amount before adjustments" |
| Glossary Terms | urn:li:glossaryTerm:NetSaleAmount |
| Tags | pii, financial |
Provenance Tracking¶
Every inherited field includes its source:
{
"column_context": {
"amount": {
"description": "Net sale amount before adjustments",
"inherited_from": {
"source_dataset": "urn:li:dataset:cassandra.system_sale",
"source_column": "initial_net",
"hops": 1,
"match_method": "name_transformed"
}
}
},
"inheritance_sources": ["urn:li:dataset:cassandra.system_sale"]
}
Configuration¶
semantic:
provider: datahub
instance: primary
lineage:
enabled: true
max_hops: 2
inherit:
- glossary_terms
- descriptions
- tags
prefer_column_lineage: true
# Strip prefixes for nested JSON paths
column_transforms:
- strip_prefix: "rxtxmsg.payload."
- strip_prefix: "rxtxmsg.header."
# Explicit mappings when lineage isn't in DataHub
aliases:
- source: "cassandra.prod_fuse.system_sale"
targets:
- "elasticsearch.default.jakes-sale-*"
Quick Start¶
Run locally with your own credentials:
# Install
go install github.com/txn2/mcp-data-platform/cmd/mcp-data-platform@latest
# Add to Claude Code
claude mcp add mcp-data-platform -- mcp-data-platform --config platform.yaml
No MCP authentication needed. Uses your configured DataHub/Trino/S3 credentials.
Deploy as a shared service with Keycloak authentication:
Users connect via Claude Desktop and authenticate through your identity provider.
Choose Your Path¶
-
Deploy the Server
Configure via YAML. Connect DataHub, add Trino and S3 if you have them. Works out of the box.
-
Build Your Own
Import the Go library. Add custom tools, swap providers, write middleware. Make it yours.
What's Included¶
| Toolkit | Tools | Purpose |
|---|---|---|
| DataHub | 11 tools | Search, metadata, lineage, glossary, domains |
| Trino | 7 tools | SQL queries, schema exploration, catalogs |
| S3 | 6-9 tools | Bucket/object operations, presigned URLs |
DataHub is the foundation and serves as your semantic layer. Add Trino for SQL queries, S3 for object storage. Use what you have.
Use Cases¶
Enterprise Data Governance¶
- Audit Trails: Every query logged with user identity
- PII Protection: Tag-based warnings for sensitive data
- Access Control: Persona system enforces who can query what
- Deprecation Enforcement: Warnings surface before using stale data
Data Democratization¶
- Self-Service: Business users explore data with context
- Cross-Team Discovery: Find datasets across systems
- Onboarding: New team members understand data immediately
AI/ML Workflows¶
- Autonomous Exploration: AI agents discover datasets without guidance
- Feature Discovery: Find ML features with quality scores
- Quality Gates: Avoid problematic datasets automatically
Security¶
Built with a fail-closed security model. Missing credentials deny access, never bypass.
| Feature | Description |
|---|---|
| Fail-Closed Authentication | Invalid credentials = denied (never bypass) |
| Required JWT Claims | Tokens must include sub and exp |
| TLS for HTTP | Configurable TLS with plaintext warnings |
| Prompt Injection Protection | Metadata sanitization |
| Read-Only Mode | Enforced at query level |
| Default-Deny Personas | No implicit tool access |
Runs With¶
- Claude Desktop (HTTP endpoint or stdio)
- Claude Code (stdio or HTTP)
- Any MCP client
Built On¶
| Project | What it does |
|---|---|
| mcp-trino | Trino queries |
| mcp-datahub | DataHub metadata |
| mcp-s3 | S3 storage |
These work standalone. This platform wires them together with cross-injection, auth, and personas.