S3 Enrichment¶

S3 enrichment works bidirectionally: S3 tool results can include DataHub metadata, and DataHub results can show S3 storage availability.

S3 to DataHub Enrichment¶

When s3_semantic_enrichment is enabled, S3 operations include semantic context from DataHub for matching datasets.

What Gets Enriched¶

Tool	Enrichment
`s3_list_objects`	Matching DataHub datasets for the bucket/prefix
`s3_get_object`	Dataset metadata if the object is cataloged
`s3_get_object_metadata`	Dataset metadata if the object is cataloged

Example: List Objects¶

Request:

List the files in the sales/orders/ folder

Tool call: s3_list_objects with bucket data-lake and prefix sales/orders/

Response with enrichment:

href="#__codelineno-1-1">{ "objects": [ { "key": "sales/orders/2024/01/orders-20240115.parquet", "size": 52428800, "last_modified": "2024-01-15T10:30:00Z" }, { "key": "sales/orders/2024/01/orders-20240116.parquet", "size": 48576000, "last_modified": "2024-01-16T10:30:00Z" } ], "semantic_context": { "matching_datasets": [ { "urn": "urn:li:dataset:(urn:li:dataPlatform:s3,data-lake/sales/orders,PROD)", "name": "Sales Orders", "description": "Daily order snapshots in Parquet format", "owners": ["Sales Data Team"], "tags": ["financial", "daily-snapshot"], "domain": "Sales", "quality_score": 0.89 } ], "note": "Semantic metadata from DataHub for S3 location" } }

How Matching Works¶

The platform searches DataHub for datasets matching the S3 location:

Constructs search query from bucket + prefix
Filters for platform: s3 datasets
Returns up to 5 matching datasets

This handles various cataloging patterns:

Exact path matches (s3://bucket/path/)
Parent directory matches
Wildcard patterns in DataHub

Configuration¶

injection:
  s3_semantic_enrichment: true

semantic:
  provider: datahub
  instance: primary

DataHub to S3 Enrichment¶

When datahub_storage_enrichment is enabled, DataHub results for S3 datasets include storage availability information.

What Gets Enriched¶

Tool	Enrichment
`datahub_search`	S3 availability for S3-platform datasets
`datahub_get_entity`	Storage details for S3 datasets

Example: Search Results¶

Request:

Find raw data assets in the data lake

Tool call: datahub_search with query raw data lake

Response with enrichment:

{
  "results": [
    {
      "urn": "urn:li:dataset:(urn:li:dataPlatform:s3,data-lake/raw/events,PROD)",
      "name": "Raw Events",
      "platform": "s3"
    }
  ],
  "storage_context": {
    "urn:li:dataset:(urn:li:dataPlatform:s3,data-lake/raw/events,PROD)": {
      "available": true,
      "connection": "data_lake",
      "bucket": "data-lake",
      "prefix": "raw/events/",
      "object_count": 1250,
      "total_size_bytes": 5368709120,
      "last_modified": "2024-01-16T15:30:00Z"
    }
  }
}

Storage Availability Details¶

For each S3 dataset, the enrichment includes:

available - Whether the S3 location is accessible
connection - Which S3 connection to use
bucket - S3 bucket name
prefix - Object key prefix
object_count - Number of objects (if enumerable)
total_size_bytes - Total storage used
last_modified - Most recent object modification

Configuration¶

injection:
  datahub_storage_enrichment: true

storage:
  provider: s3
  instance: data_lake

Combined Example¶

A search for "orders" with all enrichments enabled:

{
  "results": [
    {
      "urn": "urn:li:dataset:(urn:li:dataPlatform:trino,hive.sales.orders,PROD)",
      "name": "Orders (Trino)",
      "platform": "trino"
    },
    {
      "urn": "urn:li:dataset:(urn:li:dataPlatform:s3,data-lake/sales/orders,PROD)",
      "name": "Orders (S3)",
      "platform": "s3"
    }
  ],
  "query_context": {
    "urn:li:dataset:(urn:li:dataPlatform:trino,hive.sales.orders,PROD)": {
      "queryable": true,
      "sample_query": "SELECT * FROM hive.sales.orders LIMIT 10"
    }
  },
  "storage_context": {
    "urn:li:dataset:(urn:li:dataPlatform:s3,data-lake/sales/orders,PROD)": {
      "available": true,
      "bucket": "data-lake",
      "prefix": "sales/orders/",
      "total_size_bytes": 1073741824
    }
  }
}

This shows: - The Trino dataset is queryable with SQL - The S3 dataset has 1GB of data available for direct access

Use Cases¶

Data Discovery¶

When browsing S3, learn what datasets exist:

User: What's in the s3://data-lake/marketing/ bucket?

Claude: [Lists objects, sees semantic context]
This location contains the Marketing Campaign Data dataset:
- Owned by Marketing Analytics Team
- Contains PII (customer email addresses)
- Quality score: 0.76 (some data quality issues noted)

Storage to Query Bridge¶

Find S3 data, check if it's queryable via Trino:

User: I found raw event data in S3. Can I query it?

Claude: [Gets S3 metadata, checks DataHub, sees query context]
The raw events dataset in S3 is also available as an external table in Trino.
You can query it with: SELECT * FROM hive.raw.events LIMIT 10

Data Quality Checks¶

Before processing S3 data, check its metadata:

User: Is the orders data in S3 reliable?

Claude: [Lists S3, sees semantic context with quality score]
The orders data has a quality score of 0.89 in DataHub.
It's owned by the Sales Data Team and updated daily.
Note: It's tagged as containing financial data.

Troubleshooting¶

S3 enrichment not appearing:

Check that the bucket/prefix exists in DataHub as an S3 dataset
Verify the semantic provider is configured
Ensure s3_semantic_enrichment: true is set

Storage context not appearing for S3 datasets:

Verify the S3 connection is configured
Check that the bucket is accessible with configured credentials
Ensure datahub_storage_enrichment: true is set

Mismatched bucket names:

DataHub may catalog S3 paths differently than they appear in S3. Check: - URN format in DataHub - Bucket name normalization - Path separators

Next Steps¶

Configuration Reference - All injection options
S3 Toolkit Configuration - S3 setup