Skip to content

Governance Workflow

Knowledge capture turns catalog metadata into something that evolves with use. Instead of metadata being a static artifact maintained by a central team, it improves continuously as domain experts share what they know during their normal work. This is active metadata management: the catalog gets better every time someone uses it.

This page covers the admin-side workflow for reviewing captured insights and writing approved changes back to DataHub.

The Review Process

All captured insights start with status pending. Admins review them through the apply_knowledge tool or the Admin REST API.

Bulk Review

Get an overview of all pending insights:

{"action": "bulk_review"}

Response:

{
  "total_pending": 7,
  "by_entity": [
    {
      "entity_urn": "urn:li:dataset:(urn:li:dataPlatform:trino,hive.sales.orders,PROD)",
      "count": 3,
      "categories": ["correction", "business_context"],
      "latest_at": "2025-01-15T14:30:00Z"
    },
    {
      "entity_urn": "urn:li:dataset:(urn:li:dataPlatform:trino,hive.finance.revenue,PROD)",
      "count": 4,
      "categories": ["data_quality", "usage_guidance", "enhancement"],
      "latest_at": "2025-01-15T16:45:00Z"
    }
  ],
  "by_category": {
    "correction": 2,
    "business_context": 1,
    "data_quality": 2,
    "usage_guidance": 1,
    "enhancement": 1
  },
  "by_confidence": {
    "high": 3,
    "medium": 4
  }
}

Review by Entity

Drill into a specific entity to see its pending insights alongside current DataHub metadata:

{
  "action": "review",
  "entity_urn": "urn:li:dataset:(urn:li:dataPlatform:trino,hive.sales.orders,PROD)"
}

Response includes:

  • Current DataHub metadata (description, tags, glossary terms, owners)
  • All insights for this entity with their categories, text, confidence, and suggested actions

This side-by-side view helps admins assess whether an insight adds value compared to what's already in the catalog.

Approve or Reject

Transition insight statuses with optional review notes:

{
  "action": "approve",
  "insight_ids": ["a1b2c3d4e5f6", "f6e5d4c3b2a1"],
  "review_notes": "Verified with data engineering team"
}
{
  "action": "reject",
  "insight_ids": ["deadbeef1234"],
  "review_notes": "Already documented in the column description"
}

Response:

{
  "action": "approve",
  "updated": 2,
  "total": 2
}

If any IDs are invalid or the status transition is not allowed, those are reported in an errors array without blocking the valid ones.

Synthesizing Changes

Once insights are approved, the synthesize action gathers them and builds structured change proposals:

{
  "action": "synthesize",
  "entity_urn": "urn:li:dataset:(urn:li:dataPlatform:trino,hive.sales.orders,PROD)"
}

Response:

{
  "entity_urn": "urn:li:dataset:(urn:li:dataPlatform:trino,hive.sales.orders,PROD)",
  "current_metadata": {
    "description": "Order records",
    "tags": ["financial"],
    "glossary_terms": [],
    "owners": ["Data Platform Team"]
  },
  "approved_insights": [
    {
      "id": "a1b2c3d4e5f6",
      "category": "correction",
      "insight_text": "The amount column represents gross margin before returns, not revenue.",
      "suggested_actions": [
        {"action_type": "update_description", "target": "entity", "detail": "Order records with gross margin amounts (before returns)"}
      ]
    }
  ],
  "proposed_changes": [
    {
      "change_type": "update_description",
      "target": "entity",
      "current_value": "Order records",
      "suggested_value": "Order records with gross margin amounts (before returns)",
      "source_insight_ids": ["a1b2c3d4e5f6"]
    }
  ]
}

The synthesis output shows current values alongside proposed changes, so admins can see exactly what will change. The source_insight_ids field traces each change back to the insight that proposed it.

Applying Changes

The apply action writes changes to DataHub and records a changeset. This is the data catalog write-back step.

{
  "action": "apply",
  "entity_urn": "urn:li:dataset:(urn:li:dataPlatform:trino,hive.sales.orders,PROD)",
  "changes": [
    {
      "change_type": "update_description",
      "target": "entity",
      "detail": "Order records with gross margin amounts (before returns)"
    },
    {
      "change_type": "add_tag",
      "target": "entity",
      "detail": "gross-margin"
    }
  ],
  "insight_ids": ["a1b2c3d4e5f6"],
  "confirm": true
}

When require_confirmation is enabled in configuration and confirm is not true, the tool returns a confirmation prompt instead of applying:

{
  "confirmation_required": true,
  "entity_urn": "urn:li:dataset:(urn:li:dataPlatform:trino,hive.sales.orders,PROD)",
  "changes_count": 2,
  "message": "Set confirm: true to apply these changes."
}

Supported Change Types

Change Type Description Target Field Detail Field
update_description Update entity or column description column:<fieldPath> for columns, empty for dataset-level New description text
add_tag Add a tag to the entity (ignored) Tag name or URN (e.g., pii or urn:li:tag:pii)
remove_tag Remove a tag from the entity (ignored) Tag name or URN to remove
add_glossary_term Associate a glossary term (ignored) Glossary term name or URN
flag_quality_issue Add fixed QualityIssue tag to the entity (ignored) Issue description (stored as context in the knowledge store)
add_documentation Add a documentation link URL of the documentation Link description
add_curated_query Create a reusable Query entity linked to the dataset (ignored) Query name. Also requires query_sql (SQL statement) and optionally query_description

Tag names and glossary term names are automatically normalized to full DataHub URNs (e.g., pii becomes urn:li:tag:pii).

Apply Response

{
  "changeset_id": "cs_x1y2z3a4b5c6",
  "entity_urn": "urn:li:dataset:(urn:li:dataPlatform:trino,hive.sales.orders,PROD)",
  "changes_applied": 2,
  "insights_marked_applied": 1,
  "resulting_state": {
    "description": "Order records with gross margin amounts (before returns)",
    "tags": ["urn:li:tag:gross-margin"],
    "glossary_terms": [],
    "owners": []
  },
  "message": "Changes applied to DataHub. Roll back with action=rollback changeset_id=cs_x1y2z3a4b5c6. changes_applied counts requested changes; verify against resulting_state below."
}

Source insights move to applied status with a reference to the changeset.

changes_applied counts the changes that were dispatched without error; a duplicate add (for example a tag that was already present) is a no-op upstream and still counts. The resulting_state field is a fresh read-back of the entity's description, tags, glossary terms, and owners after the apply, so callers can confirm what actually persisted without a follow-up call. Writes are not transactional: if a change in the middle of the list fails, earlier changes have already persisted and are reported in the error message rather than silently rolled back.

Changeset Tracking

Every apply action creates a changeset record in the knowledge_changesets table:

Field Description
id Unique changeset identifier
target_urn The DataHub entity that was modified
change_type Summary of change types applied (or multiple)
previous_value Entity metadata before changes (description, tags, glossary terms, owners)
new_value Changes that were applied
source_insight_ids Insights that produced this changeset
applied_by User who applied the changes
rolled_back Whether this changeset has been reverted
rolled_back_by Who reverted the changes
rolled_back_at When the changes were reverted

The previous_value field captures the entity's metadata (description, tags, glossary terms, owners) at the time of application. This before-image is what bounds rollback: a change can be reverted only when its prior state is recoverable from this snapshot.

Discovering Changesets

Use the list_changesets action to find an entity's changesets without already holding their ids (for example, before a rollback):

{ "action": "list_changesets", "entity_urn": "urn:li:dataset:(urn:li:dataPlatform:trino,hive.sales.orders,PROD)" }

Each entry returns changeset_id, created_at, applied_by, change_type, source_insight_ids, and the current rolled_back status.

Rollback

A changeset can be rolled back through either the apply_knowledge MCP tool or the Admin REST API. Both paths use the same revert engine.

{ "action": "rollback", "changeset_id": "cs_x1y2z3a4b5c6", "confirm": true }
curl -X POST \
  https://mcp.example.com/api/v1/admin/knowledge/changesets/cs_x1y2z3a4b5c6/rollback \
  -H "Authorization: Bearer $ADMIN_TOKEN"

Rollback reverts each change the apply made, back to its before-image:

  • A tag, glossary term, or documentation link the apply added is removed, unless it was already present in the before-image (in which case the add was a no-op and the pre-existing value is kept). This is what preserves an entity's canonical glossary term when an apply added another term alongside it.
  • A tag the apply removed is restored.
  • A description the apply changed is reset to the prior description.

The rollback also transitions the changeset's source insights from applied to rolled_back and records the rollback as its own auditable tool call referencing the original changeset_id.

When rollback is refused (rather than silently applied):

  • The changeset has already been rolled back.
  • A newer, not-yet-rolled-back changeset has since modified the same aspect on the same entity. Reverting would clobber that newer change, so the rollback is blocked and names the conflicting changeset; roll the newer one back first, or restore the desired state with a fresh apply.
  • The changeset contains change types whose prior state was not captured in the before-image and therefore cannot be reverted automatically: column-level descriptions, structured properties, incidents, curated queries, context documents, and prompts. For these, restore the desired state with a new apply.

When the changeset contains a mix of revertible and unrevertible change types, the rollback is refused as a whole so it never leaves the entity in a partially reverted state.

Complete Workflow Example

An analyst discovers that the amount column in the orders table represents gross margin, not revenue. Here's how that knowledge flows through the system:

sequenceDiagram
    participant Analyst
    participant AI as AI Assistant
    participant Platform as mcp-data-platform
    participant DB as PostgreSQL
    participant Admin
    participant DH as DataHub

    Note over Analyst,AI: Discovery phase
    Analyst->>AI: What does the amount column mean?
    AI->>Platform: trino_describe_table(orders)
    Platform-->>AI: amount: DECIMAL (no description)
    Analyst->>AI: That's gross margin before returns,<br/>not revenue like the name suggests

    Note over AI,Platform: Capture
    AI->>Platform: capture_insight(<br/>category: correction,<br/>source: user,<br/>entity_urns: [urn:li:dataset:...orders...],<br/>insight_text: "amount column is gross margin<br/>before returns, not revenue",<br/>confidence: high,<br/>suggested_actions: [{<br/>  action_type: update_description,<br/>  target: amount,<br/>  detail: "Gross margin before returns"<br/>}])
    Platform->>DB: INSERT (status: pending)
    Platform-->>AI: Insight captured: a1b2c3

    Note over Admin,DH: Review phase (later)
    Admin->>Platform: apply_knowledge(action: bulk_review)
    Platform-->>Admin: 1 pending insight for orders

    Admin->>Platform: apply_knowledge(action: review,<br/>entity_urn: ...orders...)
    Platform->>DH: Get current metadata
    Platform-->>Admin: Current: no description<br/>Insight: "amount is gross margin"

    Admin->>Platform: apply_knowledge(action: approve,<br/>insight_ids: [a1b2c3])
    Platform->>DB: UPDATE status = approved

    Note over Admin,DH: Apply phase
    Admin->>Platform: apply_knowledge(action: synthesize,<br/>entity_urn: ...orders...)
    Platform-->>Admin: Proposed: update amount description

    Admin->>Platform: apply_knowledge(action: apply,<br/>changes: [{update_description,<br/>amount, "Gross margin before returns"}],<br/>insight_ids: [a1b2c3],<br/>confirm: true)
    Platform->>DH: Update column description
    Platform->>DB: INSERT changeset (previous_value saved)
    Platform->>DB: UPDATE insight status = applied
    Platform-->>Admin: Changeset cs_x1y2 recorded

The next time anyone queries this table, the enriched response includes the corrected description.

Next Steps