Knowledge Capture¶

The Problem¶

Every organization has tribal knowledge about its data: what columns actually mean, which tables are reliable, when timestamps shifted time zones, how business metrics are calculated. This knowledge lives in the heads of experienced team members and surfaces in conversations, but it rarely makes it back into the data catalog.

AI-assisted data exploration makes this worse. Users share corrections, business context, and quality observations during sessions. The AI assistant uses that context for the current conversation, and then it's gone. The next session starts from scratch.

Knowledge capture fixes this. When a user shares domain knowledge during a session, the platform records it, routes it through a governance workflow, and writes approved changes back to DataHub.

How It Works¶

The system has three components: two MCP tools and an Admin REST API.

flowchart LR
    subgraph "During AI Session"
        A[User shares<br/>domain knowledge] --> B[capture_insight<br/>tool]
    end

    subgraph "PostgreSQL"
        B --> C[(knowledge_insights<br/>status: pending)]
    end

    subgraph "Admin Review"
        C --> D[Admin reviews<br/>via apply_knowledge<br/>or REST API]
        D -->|approve| E[status: approved]
        D -->|reject| F[status: rejected]
    end

    subgraph "Catalog Write-Back"
        E --> G[synthesize<br/>change proposals]
        G --> H[apply changes<br/>to DataHub]
        H --> I[(knowledge_changesets<br/>for rollback)]
    end

capture_insight records domain knowledge during sessions. Available to all personas when enabled. Creates insights with status pending.
apply_knowledge is an admin-only tool for reviewing, approving, synthesizing, and applying insights to DataHub.
Admin REST API provides HTTP endpoints for managing insights and changesets outside the MCP protocol.

Insight Categories¶

Insights have six categories:

Category	Description	Example
`correction`	Fixes wrong metadata in the catalog	"The `amount` column is gross margin, not revenue"
`business_context`	Explains what data means in business terms	"MRR counts active subscriptions only, not trials"
`data_quality`	Reports quality issues or known limitations	"Timestamps before March 2024 are UTC; after that, America/Chicago"
`usage_guidance`	Tips for querying or interpreting data correctly	"Always filter `status='active'` to avoid soft-delete duplicates"
`relationship`	Connections between datasets not captured in lineage	"The `customer_id` in orders joins to the legacy CRM export"
`enhancement`	Suggested improvements to documentation or metadata	"Tag `sales_daily` with its 6 AM CT refresh schedule"

Insight Lifecycle¶

Insights have these statuses:

stateDiagram-v2
    [*] --> pending: capture_insight
    pending --> approved: admin approves
    pending --> rejected: admin rejects
    pending --> superseded: newer insight replaces
    approved --> applied: changes written to DataHub
    applied --> rolled_back: changeset reverted

Status	Description
`pending`	Newly captured, awaiting admin review
`approved`	Reviewed and approved, ready for synthesis and application
`rejected`	Reviewed and rejected by admin
`applied`	Changes have been written to DataHub
`superseded`	Replaced by a newer insight for the same entity
`rolled_back`	Applied changes were reverted via changeset rollback

Governance Workflow¶

Capture to catalog update:

sequenceDiagram
    participant Analyst
    participant AI as AI Assistant
    participant MCP as mcp-data-platform
    participant PG as PostgreSQL
    participant Admin
    participant DH as DataHub

    Analyst->>AI: "The amount column is actually gross margin"
    AI->>MCP: capture_insight(category: correction, ...)
    MCP->>PG: INSERT knowledge_insights (status: pending)
    MCP-->>AI: insight_id: a1b2c3...

    Note over Admin: Later, during review
    Admin->>MCP: apply_knowledge(action: bulk_review)
    MCP->>PG: SELECT pending insights
    MCP-->>Admin: 3 pending insights for orders table

    Admin->>MCP: apply_knowledge(action: approve, insight_ids: [...])
    MCP->>PG: UPDATE status = approved

    Admin->>MCP: apply_knowledge(action: synthesize, entity_urn: ...)
    MCP->>DH: Get current metadata
    MCP-->>Admin: Proposed changes with current vs suggested values

    Admin->>MCP: apply_knowledge(action: apply, changes: [...], confirm: true)
    MCP->>DH: Update description, add tags
    MCP->>PG: INSERT knowledge_changesets (previous_value for rollback)
    MCP->>PG: UPDATE insights status = applied
    MCP-->>Admin: Changeset cs_x1y2z3 recorded

This is a human-in-the-loop metadata curation workflow. Insights captured by any user go through admin review before modifying the catalog. Every change is tracked with a changeset that records previous values for rollback.

Configuration¶

knowledge:
  enabled: true
  apply:
    enabled: true
    datahub_connection: primary
    require_confirmation: true

Field	Type	Default	Description
`knowledge.enabled`	bool	`false`	Enable the knowledge capture toolkit and `capture_insight` tool
`knowledge.apply.enabled`	bool	`false`	Enable the `apply_knowledge` tool for admin review and catalog write-back
`knowledge.apply.datahub_connection`	string	-	DataHub instance name for write-back operations
`knowledge.apply.require_confirmation`	bool	`false`	When true, the `apply` action requires `confirm: true` in the request

Prerequisites

Knowledge capture requires database.dsn to be configured for PostgreSQL storage. The apply_knowledge tool requires the admin persona.

Persona Integration¶

Control who can capture and apply knowledge through persona tool filtering:

personas:
  definitions:
    analyst:
      display_name: "Data Analyst"
      roles: ["analyst"]
      tools:
        allow:
          - "trino_*"
          - "datahub_*"
          - "capture_insight"       # Can capture knowledge
        deny:
          - "apply_knowledge"       # Cannot apply changes

    admin:
      display_name: "Administrator"
      roles: ["admin"]
      tools:
        allow: ["*"]               # Full access including apply_knowledge

    etl_service:
      display_name: "ETL Service"
      roles: ["service"]
      tools:
        allow:
          - "trino_*"
        deny:
          - "capture_insight"       # Automated processes should not capture
          - "apply_knowledge"

Insight Sources¶

Insights track where the knowledge came from via the source field:

Source	Description	Example
`user`	Knowledge shared by the user during conversation (default)	User says "The amount column is gross margin, not revenue"
`agent_discovery`	Knowledge the AI agent figured out independently	Agent samples data and discovers a column contains ISO country codes
`enrichment_gap`	Metadata gap flagged for admin attention	Table has no description and the agent cannot determine its purpose from the data

The source field is optional when calling capture_insight. When omitted, it defaults to user.

AI Agent Guidance¶

The toolkit registers an MCP prompt called knowledge_capture_guidance that tells AI assistants when to capture insights. The prompt covers:

When to capture (user-provided):

User corrects a column description, table purpose, or data interpretation
User explains what data means in business terms not captured in metadata
User reports data quality issues or known limitations
User shares tips on how to query or interpret data correctly
User explains connections between datasets not captured in lineage
User suggests improvements to existing documentation or metadata

When to capture (agent-discovered):

Agent discovers what a column means by sampling actual data (set source: "agent_discovery")
Agent finds join relationships not documented in lineage metadata
Agent identifies data quality patterns (nulls, outliers, encoding issues)
Agent resolves ambiguous column names by examining values
Agent encounters metadata that is missing or clearly wrong and cannot resolve it from the data (set source: "enrichment_gap")

When to ask the user instead:

Enrichment is insufficient and the agent cannot resolve it from the data alone
Multiple interpretations are equally plausible
The insight would have high impact (e.g., PII classification, deprecation status)

When not to capture:

Transient questions or debugging ("why is my query slow?")
Personal preferences ("I prefer using CTEs")
Information already present in the catalog metadata
Vague or unverifiable claims without specific context
Trivially obvious gaps without adding what the data actually means
Speculative interpretations without evidence from querying
The same gap repeatedly within a session

The prompt is available via prompts/list and prompts/get in the MCP protocol.

Next Steps¶

Governance Workflow -- review process, synthesis, applying changes, changeset tracking, and rollback
Admin API -- REST endpoints for managing insights and changesets
Audit Logging -- all knowledge tool calls are audit logged
Personas -- control access to knowledge tools via personas