Cross-Injection¶
Cross-injection is the key differentiator of mcp-data-platform. It automatically enriches tool responses with context from other services, without additional calls from the AI assistant.
The Problem It Solves¶
Without cross-injection, understanding a single table requires multiple round-trips:
sequenceDiagram
participant AI as AI Assistant
participant T as Trino
participant D as DataHub
AI->>T: DESCRIBE orders
T-->>AI: columns, types
Note over AI: Need more context...
AI->>D: Search for "orders"
D-->>AI: URN, basic info
AI->>D: Get entity details
D-->>AI: owners, tags
AI->>D: Check deprecation
D-->>AI: deprecated: true
AI->>D: Get quality score
D-->>AI: 87%
Five API calls to understand one table. Each call adds latency and requires the AI to remember to ask.
With Cross-Injection¶
The same workflow becomes a single call:
sequenceDiagram
participant AI as AI Assistant
participant P as Platform
participant T as Trino
participant D as DataHub
AI->>P: trino_describe_table "orders"
P->>T: DESCRIBE orders
T-->>P: columns, types
P->>D: GetTableContext
D-->>P: owners, tags, quality, deprecation
Note over P: Combine into single response
P-->>AI: Schema + Complete Business Context
One call. Complete context. Warnings front and center.
How It Works¶
The enrichment middleware intercepts tool responses and adds relevant context:
flowchart TB
subgraph "Request Processing"
Req[Tool Request]
Toolkit[Toolkit Handler]
Result[Raw Result]
end
subgraph "Enrichment Layer"
Check{Check toolkit kind}
TrinoEnrich[Trino Enrichment]
DataHubEnrich[DataHub Enrichment]
S3Enrich[S3 Enrichment]
end
subgraph "Providers"
Semantic[Semantic Provider]
Query[Query Provider]
Storage[Storage Provider]
end
subgraph "Response"
Enriched[Enriched Result]
end
Req --> Toolkit
Toolkit --> Result
Result --> Check
Check -->|trino| TrinoEnrich
Check -->|datahub| DataHubEnrich
Check -->|s3| S3Enrich
TrinoEnrich --> Semantic
DataHubEnrich --> Query
DataHubEnrich --> Storage
S3Enrich --> Semantic
Semantic --> Enriched
Query --> Enriched
Storage --> Enriched
Enrichment Flow by Toolkit¶
| Toolkit | Provider Used | Context Added |
|---|---|---|
| Trino | Semantic (DataHub) | Owners, tags, quality, deprecation, glossary terms |
| DataHub | Query (Trino) + Storage (S3) | Query availability, sample SQL, storage availability |
| S3 | Semantic (DataHub) | Matching dataset metadata from DataHub |
What Gets Injected¶
Semantic Context (Trino → DataHub)¶
When you query or describe a Trino table, the response includes:
{
"columns": [
{"name": "order_id", "type": "BIGINT"},
{"name": "customer_id", "type": "BIGINT"},
{"name": "total_amount", "type": "DECIMAL(10,2)"},
{"name": "created_at", "type": "TIMESTAMP"}
],
"semantic_context": {
"urn": "urn:li:dataset:(urn:li:dataPlatform:trino,hive.sales.orders,PROD)",
"description": "Customer orders with line items and payment information",
"owners": [
{
"name": "Data Platform Team",
"type": "group",
"email": "[email protected]"
},
{
"name": "Jane Smith",
"type": "user",
"email": "[email protected]"
}
],
"tags": ["pii", "financial", "tier-1"],
"domain": {
"name": "Sales",
"urn": "urn:li:domain:sales"
},
"glossary_terms": [
{
"name": "Order",
"urn": "urn:li:glossaryTerm:order"
}
],
"quality_score": 0.92,
"deprecation": null,
"columns": {
"order_id": {
"description": "Unique order identifier",
"tags": [],
"glossary_term": null
},
"customer_id": {
"description": "Reference to customer record",
"tags": ["pii"],
"glossary_term": "customer-id"
},
"total_amount": {
"description": "Order total including tax",
"tags": ["financial"],
"glossary_term": "order-total"
}
},
"custom_properties": {
"data_retention_days": "365",
"pii_classification": "internal"
}
}
}
Deprecation Warning¶
If a table is deprecated, it appears prominently:
{
"semantic_context": {
"description": "Legacy customer orders - DO NOT USE",
"deprecation": {
"deprecated": true,
"decommission_time": "2024-06-01T00:00:00Z",
"note": "Migrated to orders_v2 with improved schema",
"replacement": {
"urn": "urn:li:dataset:(urn:li:dataPlatform:trino,hive.sales.orders_v2,PROD)",
"name": "orders_v2"
}
},
"quality_score": 0.45
}
}
Query Context (DataHub → Trino)¶
When you search DataHub, the response shows which datasets can be queried:
{
"results": [
{
"urn": "urn:li:dataset:(urn:li:dataPlatform:trino,hive.sales.orders,PROD)",
"name": "orders",
"platform": "trino",
"description": "Customer orders"
}
],
"query_context": {
"urn:li:dataset:(urn:li:dataPlatform:trino,hive.sales.orders,PROD)": {
"queryable": true,
"connection": "production",
"table_identifier": {
"catalog": "hive",
"schema": "sales",
"table": "orders"
},
"sample_query": "SELECT * FROM hive.sales.orders LIMIT 10",
"row_count": 1500000,
"last_modified": "2024-01-15T10:30:00Z"
}
}
}
Non-Queryable Datasets¶
Not all DataHub datasets have query capability:
{
"query_context": {
"urn:li:dataset:(urn:li:dataPlatform:snowflake,prod.analytics.customers,PROD)": {
"queryable": false,
"reason": "No Trino connection configured for Snowflake"
}
}
}
Storage Context (DataHub → S3)¶
When searching for S3 datasets, show storage availability:
{
"results": [
{
"urn": "urn:li:dataset:(urn:li:dataPlatform:s3,data-lake-prod/sales/orders/,PROD)",
"name": "orders",
"platform": "s3"
}
],
"storage_context": {
"urn:li:dataset:(urn:li:dataPlatform:s3,data-lake-prod/sales/orders/,PROD)": {
"available": true,
"connection": "data_lake",
"bucket": "data-lake-prod",
"prefix": "sales/orders/",
"format": "parquet",
"size_bytes": 1073741824,
"last_modified": "2024-01-15T06:00:00Z",
"partition_keys": ["year", "month"]
}
}
}
S3 → DataHub Context¶
When listing S3 objects, show matching DataHub metadata:
{
"objects": [
{
"key": "sales/orders/year=2024/month=01/data.parquet",
"size": 1048576,
"last_modified": "2024-01-15T06:00:00Z"
}
],
"semantic_context": {
"s3://data-lake-prod/sales/orders/": {
"urn": "urn:li:dataset:(urn:li:dataPlatform:s3,data-lake-prod/sales/orders/,PROD)",
"description": "Customer orders in Parquet format",
"owners": [
{"name": "Data Engineering", "type": "group"}
],
"tags": ["pii", "partitioned"],
"quality_score": 0.95
}
}
}
Configuration¶
Enable cross-injection in your configuration:
# Enable/disable specific enrichment paths
injection:
trino_semantic_enrichment: true # Trino results get DataHub context
datahub_query_enrichment: true # DataHub results show Trino availability
datahub_storage_enrichment: true # DataHub results show S3 availability
s3_semantic_enrichment: true # S3 results get DataHub context
column_context_filtering: true # Only include SQL-referenced columns (default)
# Configure the semantic provider (for Trino/S3 enrichment)
semantic:
provider: datahub
instance: primary # Must match a configured DataHub toolkit
# Caching improves enrichment performance
cache:
enabled: true
ttl: 5m
max_entries: 10000
# Configure the query provider (for DataHub enrichment)
query:
provider: trino
instance: primary # Must match a configured Trino toolkit
# Configure the storage provider (for DataHub enrichment)
storage:
provider: s3
instance: data_lake # Must match a configured S3 toolkit
Minimal Configuration¶
If you only have DataHub:
injection:
trino_semantic_enrichment: false
datahub_query_enrichment: false
s3_semantic_enrichment: false
# Only DataHub toolkit, no cross-injection
toolkits:
datahub:
primary:
url: https://datahub.example.com
token: ${DATAHUB_TOKEN}
Full Cross-Injection¶
Complete configuration with all services:
injection:
trino_semantic_enrichment: true
datahub_query_enrichment: true
datahub_storage_enrichment: true
s3_semantic_enrichment: true
column_context_filtering: true # Only include SQL-referenced columns (default)
semantic:
provider: datahub
instance: primary
cache:
enabled: true
ttl: 5m
query:
provider: trino
instance: production
storage:
provider: s3
instance: data_lake
toolkits:
datahub:
primary:
url: https://datahub.example.com
token: ${DATAHUB_TOKEN}
trino:
production:
host: trino.example.com
port: 443
ssl: true
catalog: hive
s3:
data_lake:
region: us-east-1
access_key_id: ${AWS_ACCESS_KEY_ID}
secret_access_key: ${AWS_SECRET_ACCESS_KEY}
Failure Handling¶
Enrichment is designed to fail gracefully:
DataHub Unavailable¶
sequenceDiagram
participant AI as AI Assistant
participant P as Platform
participant T as Trino
participant D as DataHub
AI->>P: trino_query
P->>T: Execute query
T-->>P: Query results
P->>D: GetTableContext
D-->>P: Connection refused
Note over P: Log warning, continue without enrichment
P-->>AI: Query results (no semantic_context)
The query succeeds, you just don't get the enrichment.
Entity Not Found¶
If a table exists in Trino but not in DataHub:
Partial Enrichment¶
Some fields may be missing if not configured in DataHub:
{
"semantic_context": {
"description": "Orders table",
"owners": [],
"tags": [],
"quality_score": null,
"deprecation": null
}
}
Performance Considerations¶
Latency Impact¶
Enrichment adds 50-200ms per request depending on:
- DataHub API response time
- Network latency
- Cache hit rate
Caching Strategy¶
semantic:
cache:
enabled: true
ttl: 5m # How long to cache entries
max_entries: 10000 # Maximum cache size
Recommended settings by use case:
| Use Case | TTL | Max Entries |
|---|---|---|
| Development | 1m | 1000 |
| Production | 5m | 10000 |
| High-traffic | 15m | 50000 |
| Real-time requirements | 30s | 5000 |
Cache Invalidation¶
The cache is time-based. For immediate updates:
- Reduce TTL
- Restart the server (clears cache)
- Use the API to clear specific entries (if implemented)
Session Metadata Deduplication¶
The Problem¶
When Trino enrichment is enabled, every tool call targeting a table receives ~2KB of semantic metadata (owners, tags, columns, glossary terms). In a typical AI session querying the same table 5-10 times, this repeats 10-20KB of identical metadata, consuming LLM context tokens with no new information.
How It Works¶
Session dedup tracks which tables have been enriched per client session. On the first call, full metadata is sent. On repeat calls within the TTL, reduced content is sent based on the configured mode.
sequenceDiagram
participant Client
participant Platform
participant Trino
participant DataHub
Note over Client,DataHub: First call - full enrichment
Client->>Platform: trino_describe_table(orders)
Platform->>Trino: Describe table
Trino-->>Platform: Columns, types
Platform->>DataHub: Get semantic context
DataHub-->>Platform: Owners, tags, glossary terms
Platform-->>Client: Result + semantic_context + column_context
Note over Client,DataHub: Repeat call - dedup active
Client->>Platform: trino_query(SELECT * FROM orders)
Platform->>Trino: Execute query
Trino-->>Platform: Query results
Note over Platform: Session cache hit for "orders"
Platform-->>Client: Result + metadata_reference
Dedup Modes¶
| Mode | First Call | Repeat Calls | Use Case |
|---|---|---|---|
reference (default) |
Full semantic_context + column_context |
metadata_reference with table names and a note |
Most scenarios - minimal tokens, LLM can refer back |
summary |
Full | Table-level semantic_context only, no column details |
When LLM needs a reminder of table-level context |
none |
Full | No enrichment appended | Maximum token savings |
Mode: reference (default)¶
Repeat calls return a compact reference:
{
"metadata_reference": {
"tables": ["hive.sales.orders"],
"note": "Full semantic metadata was provided earlier in this session. Refer to previous responses for column descriptions, tags, owners, and glossary terms."
}
}
Mode: summary¶
Repeat calls return table-level context without column details:
{
"semantic_context": {
"description": "Customer orders with line items",
"owners": [{"name": "Data Team", "type": "group"}],
"tags": ["pii", "financial"],
"domain": {"name": "Sales"},
"quality_score": 0.92
},
"note": "Summary only. Full column metadata was provided earlier in this session."
}
Mode: none¶
Repeat calls return the raw tool result with no enrichment appended.
Configuration¶
injection:
trino_semantic_enrichment: true
column_context_filtering: true # Only include SQL-referenced columns (default)
session_dedup:
enabled: true # Default: true
mode: reference # reference (default), summary, none
entry_ttl: 5m # Defaults to semantic.cache.ttl
session_timeout: 30m # Defaults to server.streamable.session_timeout
| Field | Type | Default | Description |
|---|---|---|---|
enabled |
bool | true |
Whether session dedup is active |
mode |
string | reference |
What to send for repeat queries: reference, summary, none |
entry_ttl |
duration | semantic cache TTL | How long a table is considered "already sent" |
session_timeout |
duration | streamable session timeout | Idle time before a session's dedup state is cleaned up |
Behavior Notes¶
- Enabled by default: Session dedup activates automatically when
trino_semantic_enrichment: true. Setsession_dedup.enabled: falseto disable. - Trino-only: Dedup applies only to Trino tool calls. DataHub and S3 enrichment is not deduplicated.
- In-memory state: Session state is stored in memory and lost on restart. This is by design - after restart, the LLM gets fresh metadata.
- Session isolation: Each client session has independent dedup state. Two sessions querying the same table both get full metadata on their first call.
- TTL defaults:
entry_ttldefaults to the semantic cache TTL (typically 5m).session_timeoutdefaults to the streamable HTTP session timeout (typically 30m). - SQL parsing: The dedup logic extracts table names from SQL queries, so
SELECT * FROM orders JOIN productscorrectly tracks both tables independently.
Advanced Patterns¶
Multi-Cluster Enrichment¶
When you have multiple Trino clusters:
toolkits:
trino:
production:
host: trino-prod.example.com
analytics:
host: trino-analytics.example.com
query:
provider: trino
# Map DataHub URNs to specific clusters
mappings:
"urn:li:dataset:(urn:li:dataPlatform:trino,hive.*,PROD)": production
"urn:li:dataset:(urn:li:dataPlatform:trino,analytics.*,PROD)": analytics
Custom Enrichment Fields¶
Extend enrichment with custom DataHub properties:
These appear in the response:
{
"semantic_context": {
"custom_properties": {
"data_owner_team": "platform",
"pii_classification": "internal",
"retention_policy": "365d"
}
}
}
Debugging Enrichment¶
Enable Debug Logging¶
Look for:
DEBUG enrichment: fetching semantic context table=orders
DEBUG enrichment: cache miss key=trino:hive.sales.orders
DEBUG enrichment: DataHub response status=200 duration=145ms
DEBUG enrichment: added semantic_context to result
Check Provider Connectivity¶
# Test DataHub
curl -H "Authorization: Bearer $DATAHUB_TOKEN" \
"https://datahub.example.com/openapi/v2/entity/dataset"
# Test Trino
curl "https://trino.example.com:443/v1/info"
Verify URN Matching¶
DataHub URNs must be constructed correctly:
Verify your table exists in DataHub:
curl -H "Authorization: Bearer $DATAHUB_TOKEN" \
"https://datahub.example.com/openapi/v2/search?query=your_table&entity=dataset"
Next Steps¶
- Trino → DataHub Enrichment - Deep dive into Trino enrichment
- DataHub → Trino Enrichment - Deep dive into DataHub enrichment
- S3 Enrichment - S3-specific patterns
- Examples Gallery - Real-world configurations