Skip to content

S3 Enrichment

S3 enrichment works bidirectionally: S3 tool results can include DataHub metadata, and DataHub results can show S3 storage availability.

S3 to DataHub Enrichment

When s3_semantic_enrichment is enabled, S3 operations include semantic context from DataHub for matching datasets.

What Gets Enriched

Tool Enrichment
s3_list_objects Matching DataHub datasets for the bucket/prefix
s3_get_object Dataset metadata if the object is cataloged
s3_get_object_metadata Dataset metadata if the object is cataloged

Example: List Objects

Request:

List the files in the sales/orders/ folder

Tool call: s3_list_objects with bucket data-lake and prefix sales/orders/

Response with enrichment:

{
  "objects": [
    {
      "key": "sales/orders/2024/01/orders-20240115.parquet",
      "size": 52428800,
      "last_modified": "2024-01-15T10:30:00Z"
    },
    {
      "key": "sales/orders/2024/01/orders-20240116.parquet",
      "size": 48576000,
      "last_modified": "2024-01-16T10:30:00Z"
    }
  ],
  "semantic_context": {
    "matching_datasets": [
      {
        "urn": "urn:li:dataset:(urn:li:dataPlatform:s3,data-lake/sales/orders,PROD)",
        "name": "Sales Orders",
        "description": "Daily order snapshots in Parquet format",
        "owners": ["Sales Data Team"],
        "tags": ["financial", "daily-snapshot"],
        "domain": "Sales",
        "quality_score": 0.89
      }
    ],
    "note": "Semantic metadata from DataHub for S3 location"
  }
}

How Matching Works

The platform searches DataHub for datasets matching the S3 location:

  1. Constructs search query from bucket + prefix
  2. Filters for platform: s3 datasets
  3. Returns up to 5 matching datasets

This handles various cataloging patterns:

  • Exact path matches (s3://bucket/path/)
  • Parent directory matches
  • Wildcard patterns in DataHub

Configuration

injection:
  s3_semantic_enrichment: true

semantic:
  provider: datahub
  instance: primary

DataHub to S3 Enrichment

When datahub_storage_enrichment is enabled, DataHub results for S3 datasets include storage availability information.

What Gets Enriched

Tool Enrichment
datahub_search S3 availability for S3-platform datasets
datahub_get_entity Storage details for S3 datasets

Example: Search Results

Request:

Find raw data assets in the data lake

Tool call: datahub_search with query raw data lake

Response with enrichment:

{
  "results": [
    {
      "urn": "urn:li:dataset:(urn:li:dataPlatform:s3,data-lake/raw/events,PROD)",
      "name": "Raw Events",
      "platform": "s3"
    }
  ],
  "storage_context": {
    "urn:li:dataset:(urn:li:dataPlatform:s3,data-lake/raw/events,PROD)": {
      "available": true,
      "connection": "data_lake",
      "bucket": "data-lake",
      "prefix": "raw/events/",
      "object_count": 1250,
      "total_size_bytes": 5368709120,
      "last_modified": "2024-01-16T15:30:00Z"
    }
  }
}

Storage Availability Details

For each S3 dataset, the enrichment includes:

  • available - Whether the S3 location is accessible
  • connection - Which S3 connection to use
  • bucket - S3 bucket name
  • prefix - Object key prefix
  • object_count - Number of objects (if enumerable)
  • total_size_bytes - Total storage used
  • last_modified - Most recent object modification

Configuration

injection:
  datahub_storage_enrichment: true

storage:
  provider: s3
  instance: data_lake

Combined Example

A search for "orders" with all enrichments enabled:

{
  "results": [
    {
      "urn": "urn:li:dataset:(urn:li:dataPlatform:trino,hive.sales.orders,PROD)",
      "name": "Orders (Trino)",
      "platform": "trino"
    },
    {
      "urn": "urn:li:dataset:(urn:li:dataPlatform:s3,data-lake/sales/orders,PROD)",
      "name": "Orders (S3)",
      "platform": "s3"
    }
  ],
  "query_context": {
    "urn:li:dataset:(urn:li:dataPlatform:trino,hive.sales.orders,PROD)": {
      "queryable": true,
      "sample_query": "SELECT * FROM hive.sales.orders LIMIT 10"
    }
  },
  "storage_context": {
    "urn:li:dataset:(urn:li:dataPlatform:s3,data-lake/sales/orders,PROD)": {
      "available": true,
      "bucket": "data-lake",
      "prefix": "sales/orders/",
      "total_size_bytes": 1073741824
    }
  }
}

This shows: - The Trino dataset is queryable with SQL - The S3 dataset has 1GB of data available for direct access


Use Cases

Data Discovery

When browsing S3, learn what datasets exist:

User: What's in the s3://data-lake/marketing/ bucket?

Claude: [Lists objects, sees semantic context]
This location contains the Marketing Campaign Data dataset:
- Owned by Marketing Analytics Team
- Contains PII (customer email addresses)
- Quality score: 0.76 (some data quality issues noted)

Storage to Query Bridge

Find S3 data, check if it's queryable via Trino:

User: I found raw event data in S3. Can I query it?

Claude: [Gets S3 metadata, checks DataHub, sees query context]
The raw events dataset in S3 is also available as an external table in Trino.
You can query it with: SELECT * FROM hive.raw.events LIMIT 10

Data Quality Checks

Before processing S3 data, check its metadata:

User: Is the orders data in S3 reliable?

Claude: [Lists S3, sees semantic context with quality score]
The orders data has a quality score of 0.89 in DataHub.
It's owned by the Sales Data Team and updated daily.
Note: It's tagged as containing financial data.

Troubleshooting

S3 enrichment not appearing:

  1. Check that the bucket/prefix exists in DataHub as an S3 dataset
  2. Verify the semantic provider is configured
  3. Ensure s3_semantic_enrichment: true is set

Storage context not appearing for S3 datasets:

  1. Verify the S3 connection is configured
  2. Check that the bucket is accessible with configured credentials
  3. Ensure datahub_storage_enrichment: true is set

Mismatched bucket names:

DataHub may catalog S3 paths differently than they appear in S3. Check: - URN format in DataHub - Bucket name normalization - Path separators

Next Steps