Skip to content

mcp-data-platform

Your AI assistant can run SQL. But it doesn't know that cust_id contains PII, that the table was deprecated last month, or who to ask when something breaks.

mcp-data-platform fixes that. Connect AI assistants to your data infrastructure with automatic business context from your semantic layer. Query a table and get its meaning, owners, quality scores, and deprecation warnings in the same response.

The only requirement is DataHub. Add Trino for SQL queries and S3 for object storage when you're ready. Learn why this stack →

  • Semantic-First


    DataHub is the foundation. Query a table, get its business context automatically: owners, tags, quality scores, deprecation warnings. No separate lookups needed.

  • Cross-Injection


    Trino results include DataHub metadata. DataHub searches show which datasets are queryable. Context flows between services automatically.

  • Enterprise Security


    Fail-closed authentication, TLS enforcement, prompt injection protection, and read-only mode. Built for production environments.

  • Personas


    Define who can use which tools. Analysts get read access. Admins get everything. Map from your identity provider's roles.

  • Audit Logging


    Every tool call logged to PostgreSQL: who called what, when, how long, success or failure. Automatic retention cleanup and parameter sanitization.

  • MCP Apps


    Interactive UI components for tool results. Tables, charts, and filters rendered in iframes alongside AI responses.


The Problem We Solve

When AI assistants query your data, they work blind. They see column names and types, but they don't know what the data means.

Consider this scenario: An AI assistant is helping a new analyst understand customer data. Without semantic context:

  1. "What tables have customer information?" → Returns table names
  2. "Describe the customers table" → Returns columns and types
  3. "Is this data up to date?" → Requires separate DataHub query
  4. "Who should I contact about issues?" → Requires another DataHub query
  5. "Are there data quality concerns?" → Yet another query
  6. "Wait, is this table deprecated?" → Finally discovers the table was deprecated months ago

With mcp-data-platform, the first response includes everything:

Table: customers (DEPRECATED - use customers_v2)
─────────────────────────────────────────────────
Owners: Data Platform Team
Domain: Customer Analytics
Tags: pii, gdpr-relevant
Quality Score: 67% (degraded)
Last Updated: 2 weeks ago

Columns:
  customer_id    BIGINT      Unique customer identifier
  email          VARCHAR     Contact email (PII - handle per GDPR policy)
  created_at     TIMESTAMP   Account creation timestamp
  segment        VARCHAR     Marketing segment (see glossary: customer-segment)

One call. Complete context. Warnings front and center.


See It In Action

Before: Raw Trino Response

{
  "columns": [
    {"name": "customer_id", "type": "BIGINT"},
    {"name": "email", "type": "VARCHAR"},
    {"name": "segment", "type": "VARCHAR"}
  ]
}

Just structure. No meaning. No warnings. No ownership.

After: Enriched Response with mcp-data-platform

{
  "columns": [
    {"name": "customer_id", "type": "BIGINT"},
    {"name": "email", "type": "VARCHAR"},
    {"name": "segment", "type": "VARCHAR"}
  ],
  "semantic_context": {
    "description": "Core customer records with PII data",
    "deprecation": {
      "deprecated": true,
      "note": "Use customers_v2 for GDPR compliance",
      "replacement": "urn:li:dataset:customers_v2"
    },
    "owners": [
      {"name": "Data Platform Team", "type": "group", "email": "[email protected]"}
    ],
    "tags": ["pii", "gdpr-relevant"],
    "domain": {"name": "Customer Analytics", "urn": "urn:li:domain:customer"},
    "quality_score": 0.67,
    "columns": {
      "customer_id": {"description": "Unique customer identifier", "tags": []},
      "email": {"description": "Contact email", "tags": ["pii"], "glossary_term": "customer-email"},
      "segment": {"description": "Marketing segment", "glossary_term": "customer-segment"}
    }
  }
}

The AI assistant now knows:

  • This table is deprecated with a clear migration path
  • It contains PII requiring special handling
  • Who to contact when something is wrong
  • The data quality has degraded (67%)
  • What each column actually means

How It Works

sequenceDiagram
    participant AI as AI Assistant
    participant P as mcp-data-platform
    participant T as Trino
    participant D as DataHub

    AI->>P: trino_describe_table "customers"
    P->>T: DESCRIBE customers
    T-->>P: columns, types
    P->>D: Get semantic context
    D-->>P: description, owners, tags, quality, deprecation
    P-->>AI: Schema + Full Business Context

The enrichment middleware intercepts tool responses and adds semantic context before returning to the client. This cross-injection works in both directions:

When you use You also get
Trino DataHub metadata (owners, tags, quality, deprecation)
DataHub search Which datasets are queryable in Trino
S3 DataHub metadata for matching datasets

Cross-injection details


Lineage-Aware Inheritance

Downstream datasets often lack documentation even when their upstream sources are well-documented. The platform automatically inherits column metadata from upstream tables via DataHub lineage.

sequenceDiagram
    participant AI as AI Assistant
    participant P as mcp-data-platform
    participant D as DataHub

    AI->>P: trino_describe_table "elasticsearch.sales"
    P->>D: Get schema (no descriptions)
    P->>D: Get upstream lineage
    D-->>P: cassandra.system_sale (1 hop)
    P->>D: Get upstream schema (has descriptions)
    P-->>AI: Schema + Inherited Column Context

What Gets Inherited

Metadata Example
Descriptions "Net sale amount before adjustments"
Glossary Terms urn:li:glossaryTerm:NetSaleAmount
Tags pii, financial

Provenance Tracking

Every inherited field includes its source:

{
  "column_context": {
    "amount": {
      "description": "Net sale amount before adjustments",
      "inherited_from": {
        "source_dataset": "urn:li:dataset:cassandra.system_sale",
        "source_column": "initial_net",
        "hops": 1,
        "match_method": "name_transformed"
      }
    }
  },
  "inheritance_sources": ["urn:li:dataset:cassandra.system_sale"]
}

Configuration

semantic:
  provider: datahub
  instance: primary

  lineage:
    enabled: true
    max_hops: 2
    inherit:
      - glossary_terms
      - descriptions
      - tags
    prefer_column_lineage: true

    # Strip prefixes for nested JSON paths
    column_transforms:
      - strip_prefix: "rxtxmsg.payload."
      - strip_prefix: "rxtxmsg.header."

    # Explicit mappings when lineage isn't in DataHub
    aliases:
      - source: "cassandra.prod_fuse.system_sale"
        targets:
          - "elasticsearch.default.jakes-sale-*"

Lineage inheritance details


Quick Start

Run locally with your own credentials:

# Install
go install github.com/txn2/mcp-data-platform/cmd/mcp-data-platform@latest

# Add to Claude Code
claude mcp add mcp-data-platform -- mcp-data-platform --config platform.yaml

No MCP authentication needed. Uses your configured DataHub/Trino/S3 credentials.

Deploy as a shared service with Keycloak authentication:

mcp-data-platform --config platform.yaml --transport http --address :8443

Users connect via Claude Desktop and authenticate through your identity provider.

docker run -v /path/to/platform.yaml:/etc/mcp/platform.yaml \
  ghcr.io/txn2/mcp-data-platform:latest \
  --config /etc/mcp/platform.yaml

Choose Your Path

  • Deploy the Server


    Configure via YAML. Connect DataHub, add Trino and S3 if you have them. Works out of the box.

    Server Guide

  • Build Your Own


    Import the Go library. Add custom tools, swap providers, write middleware. Make it yours.

    Library Guide


What's Included

Toolkit Tools Purpose
DataHub 11 tools Search, metadata, lineage, glossary, domains
Trino 7 tools SQL queries, schema exploration, catalogs
S3 6-9 tools Bucket/object operations, presigned URLs

DataHub is the foundation and serves as your semantic layer. Add Trino for SQL queries, S3 for object storage. Use what you have.

Tools reference


Use Cases

Enterprise Data Governance

  • Audit Trails: Every query logged with user identity
  • PII Protection: Tag-based warnings for sensitive data
  • Access Control: Persona system enforces who can query what
  • Deprecation Enforcement: Warnings surface before using stale data

Data Democratization

  • Self-Service: Business users explore data with context
  • Cross-Team Discovery: Find datasets across systems
  • Onboarding: New team members understand data immediately

AI/ML Workflows

  • Autonomous Exploration: AI agents discover datasets without guidance
  • Feature Discovery: Find ML features with quality scores
  • Quality Gates: Avoid problematic datasets automatically

Examples and patterns


Security

Built with a fail-closed security model. Missing credentials deny access, never bypass.

Feature Description
Fail-Closed Authentication Invalid credentials = denied (never bypass)
Required JWT Claims Tokens must include sub and exp
TLS for HTTP Configurable TLS with plaintext warnings
Prompt Injection Protection Metadata sanitization
Read-Only Mode Enforced at query level
Default-Deny Personas No implicit tool access

Security documentation


Runs With

  • Claude Desktop (HTTP endpoint or stdio)
  • Claude Code (stdio or HTTP)
  • Any MCP client

Built On

Project What it does
mcp-trino Trino queries
mcp-datahub DataHub metadata
mcp-s3 S3 storage

These work standalone. This platform wires them together with cross-injection, auth, and personas.