S3 Enrichment¶
S3 enrichment works bidirectionally: S3 tool results can include DataHub metadata, and DataHub results can show S3 storage availability.
S3 to DataHub Enrichment¶
When s3_semantic_enrichment is enabled, S3 operations include semantic context from DataHub for matching datasets.
What Gets Enriched¶
| Tool | Enrichment |
|---|---|
s3_list_objects |
Matching DataHub datasets for the bucket/prefix |
s3_get_object |
Dataset metadata if the object is cataloged |
s3_get_object_metadata |
Dataset metadata if the object is cataloged |
Example: List Objects¶
Request:
Tool call: s3_list_objects with bucket data-lake and prefix sales/orders/
Response with enrichment:
{
"objects": [
{
"key": "sales/orders/2024/01/orders-20240115.parquet",
"size": 52428800,
"last_modified": "2024-01-15T10:30:00Z"
},
{
"key": "sales/orders/2024/01/orders-20240116.parquet",
"size": 48576000,
"last_modified": "2024-01-16T10:30:00Z"
}
],
"semantic_context": {
"matching_datasets": [
{
"urn": "urn:li:dataset:(urn:li:dataPlatform:s3,data-lake/sales/orders,PROD)",
"name": "Sales Orders",
"description": "Daily order snapshots in Parquet format",
"owners": ["Sales Data Team"],
"tags": ["financial", "daily-snapshot"],
"domain": "Sales",
"quality_score": 0.89
}
],
"note": "Semantic metadata from DataHub for S3 location"
}
}
How Matching Works¶
The platform searches DataHub for datasets matching the S3 location:
- Constructs search query from bucket + prefix
- Filters for
platform: s3datasets - Returns up to 5 matching datasets
This handles various cataloging patterns:
- Exact path matches (
s3://bucket/path/) - Parent directory matches
- Wildcard patterns in DataHub
Configuration¶
DataHub to S3 Enrichment¶
When datahub_storage_enrichment is enabled, DataHub results for S3 datasets include storage availability information.
What Gets Enriched¶
| Tool | Enrichment |
|---|---|
datahub_search |
S3 availability for S3-platform datasets |
datahub_get_entity |
Storage details for S3 datasets |
Example: Search Results¶
Request:
Tool call: datahub_search with query raw data lake
Response with enrichment:
{
"results": [
{
"urn": "urn:li:dataset:(urn:li:dataPlatform:s3,data-lake/raw/events,PROD)",
"name": "Raw Events",
"platform": "s3"
}
],
"storage_context": {
"urn:li:dataset:(urn:li:dataPlatform:s3,data-lake/raw/events,PROD)": {
"available": true,
"connection": "data_lake",
"bucket": "data-lake",
"prefix": "raw/events/",
"object_count": 1250,
"total_size_bytes": 5368709120,
"last_modified": "2024-01-16T15:30:00Z"
}
}
}
Storage Availability Details¶
For each S3 dataset, the enrichment includes:
- available - Whether the S3 location is accessible
- connection - Which S3 connection to use
- bucket - S3 bucket name
- prefix - Object key prefix
- object_count - Number of objects (if enumerable)
- total_size_bytes - Total storage used
- last_modified - Most recent object modification
Configuration¶
Combined Example¶
A search for "orders" with all enrichments enabled:
{
"results": [
{
"urn": "urn:li:dataset:(urn:li:dataPlatform:trino,hive.sales.orders,PROD)",
"name": "Orders (Trino)",
"platform": "trino"
},
{
"urn": "urn:li:dataset:(urn:li:dataPlatform:s3,data-lake/sales/orders,PROD)",
"name": "Orders (S3)",
"platform": "s3"
}
],
"query_context": {
"urn:li:dataset:(urn:li:dataPlatform:trino,hive.sales.orders,PROD)": {
"queryable": true,
"sample_query": "SELECT * FROM hive.sales.orders LIMIT 10"
}
},
"storage_context": {
"urn:li:dataset:(urn:li:dataPlatform:s3,data-lake/sales/orders,PROD)": {
"available": true,
"bucket": "data-lake",
"prefix": "sales/orders/",
"total_size_bytes": 1073741824
}
}
}
This shows: - The Trino dataset is queryable with SQL - The S3 dataset has 1GB of data available for direct access
Use Cases¶
Data Discovery¶
When browsing S3, learn what datasets exist:
User: What's in the s3://data-lake/marketing/ bucket?
Claude: [Lists objects, sees semantic context]
This location contains the Marketing Campaign Data dataset:
- Owned by Marketing Analytics Team
- Contains PII (customer email addresses)
- Quality score: 0.76 (some data quality issues noted)
Storage to Query Bridge¶
Find S3 data, check if it's queryable via Trino:
User: I found raw event data in S3. Can I query it?
Claude: [Gets S3 metadata, checks DataHub, sees query context]
The raw events dataset in S3 is also available as an external table in Trino.
You can query it with: SELECT * FROM hive.raw.events LIMIT 10
Data Quality Checks¶
Before processing S3 data, check its metadata:
User: Is the orders data in S3 reliable?
Claude: [Lists S3, sees semantic context with quality score]
The orders data has a quality score of 0.89 in DataHub.
It's owned by the Sales Data Team and updated daily.
Note: It's tagged as containing financial data.
Troubleshooting¶
S3 enrichment not appearing:
- Check that the bucket/prefix exists in DataHub as an S3 dataset
- Verify the semantic provider is configured
- Ensure
s3_semantic_enrichment: trueis set
Storage context not appearing for S3 datasets:
- Verify the S3 connection is configured
- Check that the bucket is accessible with configured credentials
- Ensure
datahub_storage_enrichment: trueis set
Mismatched bucket names:
DataHub may catalog S3 paths differently than they appear in S3. Check: - URN format in DataHub - Bucket name normalization - Path separators
Next Steps¶
- Configuration Reference - All injection options
- S3 Toolkit Configuration - S3 setup