mcp-data-platform¶

Your AI assistant can run SQL. But it doesn't know that cust_id contains PII, that the table was deprecated last month, or who to ask when something breaks.

mcp-data-platform fixes that. Connect AI assistants to your data infrastructure with automatic business context from your semantic layer. Query a table and get its meaning, owners, quality scores, and deprecation warnings in the same response.

The only requirement is DataHub. Add Trino for SQL queries and S3 for object storage when you're ready. Learn why this stack →

Semantic-First

DataHub is the foundation. Query a table, get its business context automatically: owners, tags, quality scores, deprecation warnings. No separate lookups needed.
Cross-Injection

Trino results include DataHub metadata. DataHub searches show which datasets are queryable. Context flows between services automatically.
Enterprise Security

Fail-closed authentication, TLS enforcement, prompt injection protection, and read-only mode. Built for production environments.
Personas

Define who can use which tools. Analysts get read access. Admins get everything. Map from your identity provider's roles.
Audit Logging

Every tool call logged to PostgreSQL: who called what, when, how long, success or failure. Automatic retention cleanup and parameter sanitization.
MCP Apps

Interactive UI components for tool results. Tables, charts, and filters rendered in iframes alongside AI responses.

The Problem We Solve¶

When AI assistants query your data, they work blind. They see column names and types, but they don't know what the data means.

Consider this scenario: An AI assistant is helping a new analyst understand customer data. Without semantic context:

"What tables have customer information?" → Returns table names
"Describe the customers table" → Returns columns and types
"Is this data up to date?" → Requires separate DataHub query
"Who should I contact about issues?" → Requires another DataHub query
"Are there data quality concerns?" → Yet another query
"Wait, is this table deprecated?" → Finally discovers the table was deprecated months ago

With mcp-data-platform, the first response includes everything:

Table: customers (DEPRECATED - use customers_v2)
─────────────────────────────────────────────────
Owners: Data Platform Team
Domain: Customer Analytics
Tags: pii, gdpr-relevant
Quality Score: 67% (degraded)
Last Updated: 2 weeks ago

Columns:
  customer_id    BIGINT      Unique customer identifier
  email          VARCHAR     Contact email (PII - handle per GDPR policy)
  created_at     TIMESTAMP   Account creation timestamp
  segment        VARCHAR     Marketing segment (see glossary: customer-segment)

One call. Complete context. Warnings front and center.

See It In Action¶

Before: Raw Trino Response¶

{
  "columns": [
    {"name": "customer_id", "type": "BIGINT"},
    {"name": "email", "type": "VARCHAR"},
    {"name": "segment", "type": "VARCHAR"}
  ]
}

Just structure. No meaning. No warnings. No ownership.

After: Enriched Response with mcp-data-platform¶

{
  "columns": [
    {"name": "customer_id", "type": "BIGINT"},
    {"name": "email", "type": "VARCHAR"},
    {"name": "segment", "type": "VARCHAR"}
  ],
  "semantic_context": {
    "description": "Core customer records with PII data",
    "deprecation": {
      "deprecated": true,
      "note": "Use customers_v2 for GDPR compliance",
      "replacement": "urn:li:dataset:customers_v2"
    },
    "owners": [
      {"name": "Data Platform Team", "type": "group", "email": "[email protected]"}
    ],
    "tags": ["pii", "gdpr-relevant"],
    "domain": {"name": "Customer Analytics", "urn": "urn:li:domain:customer"},
    "quality_score": 0.67,
    "columns": {
      "customer_id": {"description": "Unique customer identifier", "tags": []},
      "email": {"description": "Contact email", "tags": ["pii"], "glossary_term": "customer-email"},
      "segment": {"description": "Marketing segment", "glossary_term": "customer-segment"}
    }
  }
}

The AI assistant now knows:

This table is deprecated with a clear migration path
It contains PII requiring special handling
Who to contact when something is wrong
The data quality has degraded (67%)
What each column actually means

How It Works¶

sequenceDiagram
    participant AI as AI Assistant
    participant P as mcp-data-platform
    participant T as Trino
    participant D as DataHub

    AI->>P: trino_describe_table "customers"
    P->>T: DESCRIBE customers
    T-->>P: columns, types
    P->>D: Get semantic context
    D-->>P: description, owners, tags, quality, deprecation
    P-->>AI: Schema + Full Business Context

The enrichment middleware intercepts tool responses and adds semantic context before returning to the client. This cross-injection works in both directions:

When you use	You also get
Trino	DataHub metadata (owners, tags, quality, deprecation)
DataHub search	Which datasets are queryable in Trino
S3	DataHub metadata for matching datasets

Cross-injection details

Lineage-Aware Inheritance¶

Downstream datasets often lack documentation even when their upstream sources are well-documented. The platform automatically inherits column metadata from upstream tables via DataHub lineage.

sequenceDiagram
    participant AI as AI Assistant
    participant P as mcp-data-platform
    participant D as DataHub

    AI->>P: trino_describe_table "elasticsearch.sales"
    P->>D: Get schema (no descriptions)
    P->>D: Get upstream lineage
    D-->>P: cassandra.system_sale (1 hop)
    P->>D: Get upstream schema (has descriptions)
    P-->>AI: Schema + Inherited Column Context

What Gets Inherited¶

Metadata	Example
Descriptions	"Net sale amount before adjustments"
Glossary Terms	`urn:li:glossaryTerm:NetSaleAmount`
Tags	`pii`, `financial`

Provenance Tracking¶

Every inherited field includes its source:

{
  "column_context": {
    "amount": {
      "description": "Net sale amount before adjustments",
      "inherited_from": {
        "source_dataset": "urn:li:dataset:cassandra.system_sale",
        "source_column": "initial_net",
        "hops": 1,
        "match_method": "name_transformed"
      }
    }
  },
  "inheritance_sources": ["urn:li:dataset:cassandra.system_sale"]
}

Configuration¶

semantic:
  provider: datahub
  instance: primary

  lineage:
    enabled: true
    max_hops: 2
    inherit:
      - glossary_terms
      - descriptions
      - tags
    prefer_column_lineage: true

    # Strip prefixes for nested JSON paths
    column_transforms:
      - strip_prefix: "rxtxmsg.payload."
      - strip_prefix: "rxtxmsg.header."

    # Explicit mappings when lineage isn't in DataHub
    aliases:
      - source: "cassandra.prod_fuse.system_sale"
        targets:
          - "elasticsearch.default.jakes-sale-*"

Lineage inheritance details

Quick Start¶

Local (stdio)Remote (HTTP)Docker

Run locally with your own credentials:

# Install
go install github.com/txn2/mcp-data-platform/cmd/mcp-data-platform@latest

# Add to Claude Code
claude mcp add mcp-data-platform -- mcp-data-platform --config platform.yaml

No MCP authentication needed. Uses your configured DataHub/Trino/S3 credentials.

Deploy as a shared service with Keycloak authentication:

mcp-data-platform --config platform.yaml --transport http --address :8443

Users connect via Claude Desktop and authenticate through your identity provider.

docker run -v /path/to/platform.yaml:/etc/mcp/platform.yaml \
  ghcr.io/txn2/mcp-data-platform:latest \
  --config /etc/mcp/platform.yaml

Choose Your Path¶

Deploy the Server

Configure via YAML. Connect DataHub, add Trino and S3 if you have them. Works out of the box.

Server Guide
Build Your Own

Import the Go library. Add custom tools, swap providers, write middleware. Make it yours.

Library Guide

What's Included¶

Toolkit	Tools	Purpose
DataHub	11 tools	Search, metadata, lineage, glossary, domains
Trino	7 tools	SQL queries, schema exploration, catalogs
S3	6-9 tools	Bucket/object operations, presigned URLs

DataHub is the foundation and serves as your semantic layer. Add Trino for SQL queries, S3 for object storage. Use what you have.

Tools reference

Use Cases¶

Enterprise Data Governance¶

Audit Trails: Every query logged with user identity
PII Protection: Tag-based warnings for sensitive data
Access Control: Persona system enforces who can query what
Deprecation Enforcement: Warnings surface before using stale data

Data Democratization¶

Self-Service: Business users explore data with context
Cross-Team Discovery: Find datasets across systems
Onboarding: New team members understand data immediately

AI/ML Workflows¶

Autonomous Exploration: AI agents discover datasets without guidance
Feature Discovery: Find ML features with quality scores
Quality Gates: Avoid problematic datasets automatically

Examples and patterns

Security¶

Built with a fail-closed security model. Missing credentials deny access, never bypass.

Feature	Description
Fail-Closed Authentication	Invalid credentials = denied (never bypass)
Required JWT Claims	Tokens must include `sub` and `exp`
TLS for HTTP	Configurable TLS with plaintext warnings
Prompt Injection Protection	Metadata sanitization
Read-Only Mode	Enforced at query level
Default-Deny Personas	No implicit tool access

Security documentation

Runs With¶

Claude Desktop (HTTP endpoint or stdio)
Claude Code (stdio or HTTP)
Any MCP client

Built On¶

Project	What it does
mcp-trino	Trino queries
mcp-datahub	DataHub metadata
mcp-s3	S3 storage

These work standalone. This platform wires them together with cross-injection, auth, and personas.