Knowledge Capture¶
The Problem¶
Every organization has tribal knowledge about its data: what columns actually mean, which tables are reliable, when timestamps shifted time zones, how business metrics are calculated. This knowledge lives in the heads of experienced team members and surfaces in conversations, but it rarely makes it back into the data catalog.
AI-assisted data exploration makes this worse. Users share corrections, business context, and quality observations during sessions. The AI assistant uses that context for the current conversation, and then it's gone. The next session starts from scratch.
Knowledge capture fixes this. When a user shares domain knowledge during a session, the platform records it, routes it through a governance workflow, and writes approved changes back to DataHub.
How It Works¶
The system has three components: two MCP tools and an Admin REST API.
flowchart LR
subgraph "During AI Session"
A[User shares<br/>domain knowledge] --> B[capture_insight<br/>tool]
end
subgraph "PostgreSQL"
B --> C[(knowledge_insights<br/>status: pending)]
end
subgraph "Admin Review"
C --> D[Admin reviews<br/>via apply_knowledge<br/>or REST API]
D -->|approve| E[status: approved]
D -->|reject| F[status: rejected]
end
subgraph "Catalog Write-Back"
E --> G[synthesize<br/>change proposals]
G --> H[apply changes<br/>to DataHub]
H --> I[(knowledge_changesets<br/>for rollback)]
end
capture_insightrecords domain knowledge during sessions. Available to all personas when enabled. Creates insights with statuspending.apply_knowledgeis an admin-only tool for reviewing, approving, synthesizing, and applying insights to DataHub.- Admin REST API provides HTTP endpoints for managing insights and changesets outside the MCP protocol.
Insight Categories¶
Insights have six categories:
| Category | Description | Example |
|---|---|---|
correction |
Fixes wrong metadata in the catalog | "The amount column is gross margin, not revenue" |
business_context |
Explains what data means in business terms | "MRR counts active subscriptions only, not trials" |
data_quality |
Reports quality issues or known limitations | "Timestamps before March 2024 are UTC; after that, America/Chicago" |
usage_guidance |
Tips for querying or interpreting data correctly | "Always filter status='active' to avoid soft-delete duplicates" |
relationship |
Connections between datasets not captured in lineage | "The customer_id in orders joins to the legacy CRM export" |
enhancement |
Suggested improvements to documentation or metadata | "Tag sales_daily with its 6 AM CT refresh schedule" |
Insight Lifecycle¶
Insights have these statuses:
stateDiagram-v2
[*] --> pending: capture_insight
pending --> approved: admin approves
pending --> rejected: admin rejects
pending --> superseded: newer insight replaces
approved --> applied: changes written to DataHub
applied --> rolled_back: changeset reverted
| Status | Description |
|---|---|
pending |
Newly captured, awaiting admin review |
approved |
Reviewed and approved, ready for synthesis and application |
rejected |
Reviewed and rejected by admin |
applied |
Changes have been written to DataHub |
superseded |
Replaced by a newer insight for the same entity |
rolled_back |
Applied changes were reverted via changeset rollback |
Governance Workflow¶
Capture to catalog update:
sequenceDiagram
participant Analyst
participant AI as AI Assistant
participant MCP as mcp-data-platform
participant PG as PostgreSQL
participant Admin
participant DH as DataHub
Analyst->>AI: "The amount column is actually gross margin"
AI->>MCP: capture_insight(category: correction, ...)
MCP->>PG: INSERT knowledge_insights (status: pending)
MCP-->>AI: insight_id: a1b2c3...
Note over Admin: Later, during review
Admin->>MCP: apply_knowledge(action: bulk_review)
MCP->>PG: SELECT pending insights
MCP-->>Admin: 3 pending insights for orders table
Admin->>MCP: apply_knowledge(action: approve, insight_ids: [...])
MCP->>PG: UPDATE status = approved
Admin->>MCP: apply_knowledge(action: synthesize, entity_urn: ...)
MCP->>DH: Get current metadata
MCP-->>Admin: Proposed changes with current vs suggested values
Admin->>MCP: apply_knowledge(action: apply, changes: [...], confirm: true)
MCP->>DH: Update description, add tags
MCP->>PG: INSERT knowledge_changesets (previous_value for rollback)
MCP->>PG: UPDATE insights status = applied
MCP-->>Admin: Changeset cs_x1y2z3 recorded
This is a human-in-the-loop metadata curation workflow. Insights captured by any user go through admin review before modifying the catalog. Every change is tracked with a changeset that records previous values for rollback.
Configuration¶
knowledge:
enabled: true
apply:
enabled: true
datahub_connection: primary
require_confirmation: true
| Field | Type | Default | Description |
|---|---|---|---|
knowledge.enabled |
bool | false |
Enable the knowledge capture toolkit and capture_insight tool |
knowledge.apply.enabled |
bool | false |
Enable the apply_knowledge tool for admin review and catalog write-back |
knowledge.apply.datahub_connection |
string | - | DataHub instance name for write-back operations |
knowledge.apply.require_confirmation |
bool | false |
When true, the apply action requires confirm: true in the request |
Prerequisites
Knowledge capture requires database.dsn to be configured for PostgreSQL storage. The apply_knowledge tool requires the admin persona.
Persona Integration¶
Control who can capture and apply knowledge through persona tool filtering:
personas:
definitions:
analyst:
display_name: "Data Analyst"
roles: ["analyst"]
tools:
allow:
- "trino_*"
- "datahub_*"
- "capture_insight" # Can capture knowledge
deny:
- "apply_knowledge" # Cannot apply changes
admin:
display_name: "Administrator"
roles: ["admin"]
tools:
allow: ["*"] # Full access including apply_knowledge
etl_service:
display_name: "ETL Service"
roles: ["service"]
tools:
allow:
- "trino_*"
deny:
- "capture_insight" # Automated processes should not capture
- "apply_knowledge"
Insight Sources¶
Insights track where the knowledge came from via the source field:
| Source | Description | Example |
|---|---|---|
user |
Knowledge shared by the user during conversation (default) | User says "The amount column is gross margin, not revenue" |
agent_discovery |
Knowledge the AI agent figured out independently | Agent samples data and discovers a column contains ISO country codes |
enrichment_gap |
Metadata gap flagged for admin attention | Table has no description and the agent cannot determine its purpose from the data |
The source field is optional when calling capture_insight. When omitted, it defaults to user.
AI Agent Guidance¶
The toolkit registers an MCP prompt called knowledge_capture_guidance that tells AI assistants when to capture insights. The prompt covers:
When to capture (user-provided):
- User corrects a column description, table purpose, or data interpretation
- User explains what data means in business terms not captured in metadata
- User reports data quality issues or known limitations
- User shares tips on how to query or interpret data correctly
- User explains connections between datasets not captured in lineage
- User suggests improvements to existing documentation or metadata
When to capture (agent-discovered):
- Agent discovers what a column means by sampling actual data (set
source: "agent_discovery") - Agent finds join relationships not documented in lineage metadata
- Agent identifies data quality patterns (nulls, outliers, encoding issues)
- Agent resolves ambiguous column names by examining values
- Agent encounters metadata that is missing or clearly wrong and cannot resolve it from the data (set
source: "enrichment_gap")
When to ask the user instead:
- Enrichment is insufficient and the agent cannot resolve it from the data alone
- Multiple interpretations are equally plausible
- The insight would have high impact (e.g., PII classification, deprecation status)
When not to capture:
- Transient questions or debugging ("why is my query slow?")
- Personal preferences ("I prefer using CTEs")
- Information already present in the catalog metadata
- Vague or unverifiable claims without specific context
- Trivially obvious gaps without adding what the data actually means
- Speculative interpretations without evidence from querying
- The same gap repeatedly within a session
The prompt is available via prompts/list and prompts/get in the MCP protocol.
Next Steps¶
- Governance Workflow -- review process, synthesis, applying changes, changeset tracking, and rollback
- Admin API -- REST endpoints for managing insights and changesets
- Audit Logging -- all knowledge tool calls are audit logged
- Personas -- control access to knowledge tools via personas