Activating Your Apache Iceberg Data Lake for AI Personalization with Shaped

Apache Iceberg is revolutionizing data lake management with features like ACID transactions, schema evolution, and time travel, making it ideal for reliable analytics. But turning that structured data into real-time, AI-powered personalization still poses challenges. This article shows how Shaped’s native Iceberg connector bridges that gap: it connects directly to Iceberg tables (via Glue or Hive catalogs), ingests consistent snapshots, and trains advanced ML models for personalized recommendations and search, all without building complex pipelines. Learn how to activate your Iceberg data for intelligent, real-time experiences with minimal setup.

Bringing Intelligence to Your Data Lake Tables

Apache Iceberg has rapidly become a leading open table format, bringing database-like reliability, performance, and features (like ACID transactions, time travel, and schema evolution) to massive datasets stored in data lakes like Amazon S3. Organizations are increasingly standardizing on Iceberg to manage their analytical data effectively. While excellent for BI and analytics, the next frontier is activating this well-managed, reliable data for operational, AI-driven use cases like real-time personalization.

How do you leverage the trustworthy, versioned data in your Iceberg tables to power sophisticated recommendation models or personalize search results without complex ETL processes? How do you train state-of-the-art machine learning models directly on your data lake assets? This is where Shaped's native Apache Iceberg connector provides a seamless and powerful integration.

Shaped is an AI-native relevance platform designed to connect directly to your Iceberg tables (via catalogs like AWS Glue or Hive Metastore), ingest data efficiently, train advanced ML models, and serve personalized search rankings and recommendations through simple APIs. This post outlines the benefits of connecting your Iceberg data lake to Shaped and provides a guide to setting up the integration.

Why Connect Apache Iceberg to Shaped? Leverage Your Reliable Data Lake

Connecting your Iceberg tables directly to Shaped allows you to bridge your analytical data foundation with cutting-edge AI personalization, unlocking significant advantages:

  • Activate Your Data Lake Investment: Directly utilize the curated, governed, and reliable data stored in your Iceberg tables to fuel personalization models, maximizing the ROI on your data lake infrastructure.
  • Leverage Iceberg's Reliability: Train models on consistent data snapshots provided by Iceberg, avoiding issues related to partial reads or inconsistent data states common with raw object storage access. Benefit from Iceberg's schema evolution support.
  • Power Data-Rich Recommendations: Use comprehensive data from Iceberg tables for superior personalization:
    • Train on Verified Historical Data: Build models on large-scale, transactionally consistent interaction logs managed as Iceberg tables.
    • Utilize Curated Catalogs: Sync detailed product or content metadata directly from governed Iceberg catalog tables.
    • Incorporate Analytical Features: Leverage user segments or features computed by analytical jobs and stored in Iceberg tables to inform personalization.
  • Enhance Search with Reliable Data: Improve search relevance using trusted data from your lake:
    • Attribute-Based Filtering: Use accurate, up-to-date item attributes from Iceberg catalog tables for reliable filtering via Shaped's APIs.
    • Train on Consistent Engagement Data: Optimize search ranking models using historical user interaction data stored reliably in Iceberg format.
  • Simplified Data Pipelines: Eliminate the need for complex ETL jobs to export data out of your Iceberg data lake into a separate system for ML. Shaped reads directly from the tables defined in your Iceberg catalog.
  • Efficient Data Syncing: Shaped leverages Iceberg's metadata and snapshot capabilities to efficiently identify and sync only new or changed data after the initial load.

How it Works: The Iceberg Dataset Connector

Shaped connects to your Apache Iceberg tables by interacting with your chosen Iceberg catalog (AWS Glue Data Catalog or Hive Metastore). Shaped uses the catalog to discover the table's schema, metadata, and the location of the underlying data files (typically stored in object storage like S3).

Shaped then needs read access to:

  1. The Iceberg Catalog: To read table metadata.
  2. The Underlying Data Storage: (e.g., S3) To read the actual data files (Parquet, ORC, Avro) pointed to by the Iceberg manifest files.

Access is typically granted by giving Shaped's AWS service account appropriate IAM permissions, potentially via an assumed role (aws_role_arn) for enhanced security or cross-account access.

Connecting Apache Iceberg to Shaped

The setup involves granting Shaped necessary read permissions and then configuring the dataset connection within Shaped.

Step 1: Prepare Access Permissions

Shaped requires read-only access to interact with your Iceberg catalog and read the underlying data files.

  1. Contact Shaped: Reach out to the Shaped team (via support or your sales contact) to obtain the ARN of Shaped's AWS service account (<OUR_SERVICE_ACCOUNT_ARN>).
  2. Grant Permissions: The exact permissions depend on your setup (Catalog type, storage location):
    • Catalog Access (e.g., AWS Glue): Grant Shaped's service account permissions to read from the AWS Glue Data Catalog (e.g., glue:GetTable, glue:GetPartitions).
    • Storage Access (e.g., S3): Grant Shaped's service account permissions to read the data files from the S3 bucket(s) where your Iceberg table data resides. This typically includes s3:GetObject and potentially s3:ListBucket on the relevant paths.
    • Using aws_role_arn (Recommended for Secure Setups): Instead of granting direct access, you can create an IAM Role in your AWS account that does have the necessary Glue and S3 read permissions. Then, grant Shaped's service account permission to assume this role (sts:AssumeRole). You will provide this aws_role_arn to Shaped during configuration. This is generally the most secure approach, especially for cross-account access. Consult AWS documentation for setting up cross-account role assumption.

Ensure the permissions allow Shaped to read both the Iceberg metadata (via the catalog) and the data files (in S3 or other storage).

Step 2: Configure the Shaped Dataset (YAML)

Define the Iceberg table details and connection parameters in a Shaped dataset configuration file.

iceberg_dataset.yaml

1 name: your_iceberg_dataset_name # Choose a descriptive name
2 
3 # --- Required Fields ---
4 schema_type: ICEBERG # Specifies the connector type
5 
6 # Type of Iceberg catalog used (e.g., AWS Glue, Hive Metastore)
7 # Supported: glue, hive
8 catalog_type: glue
9 
10 # Name of the catalog as configured in your environment
11 # (e.g., your Glue Data Catalog name if using Glue, often the AWS account ID)
12 catalog_name: your_glue_catalog_name # Or your_hive_catalog_name
13 
14 # Name of the specific Iceberg table within the specified database/namespace and catalog
15 table_name: your_iceberg_table_name
16 
17 # --- Optional Fields ---
18 
19 # If using cross-account access or specific permissions, provide the ARN
20 # of the IAM Role that Shaped should assume. This role needs read access
21 # to the catalog and the underlying storage (e.g., S3).
22 aws_role_arn: arn:aws:iam::YOUR_ACCOUNT_ID:role/YourShapedAccessRole
23 
24 # AWS Region where the Iceberg catalog and potentially data reside.
25 # Required if different from Shaped's default region or if needed for role assumption.
26 aws_region: us-west-2
27 
28 # Columns uniquely identifying a row within the dataset (for deduplication).
29 # Shaped uses the latest record based on Iceberg's transaction history if duplicates exist.
30 unique_keys: ["user_id", "event_id"]
31 
32 # Number of records fetched per batch during sync (Default: 10000).
33 batch_size: 50000
34 
35 # NOTE: You do NOT typically define 'columns' or a 'replication_key' here.
36 # Shaped infers the schema and handles incremental updates using Iceberg's
37 # snapshot metadata directly.

Key Configuration Points:

  • schema_type: Must be ICEBERG.
  • catalog_type, catalog_name, table_name: Provide the precise details identifying your Iceberg table within its catalog. Ensure the table name includes the database/namespace if applicable (e.g., my_database.my_table).
  • aws_role_arn: Strongly recommended for secure, cross-account access. Ensure the role has sufficient permissions.
  • Schema & Incremental Sync: Unlike other connectors, you usually don't specify columns or a replication_key. Shaped reads the schema from the Iceberg metadata and uses Iceberg's snapshot mechanism to efficiently process only new data since the last sync.

Step 3: Create the Dataset in Shaped

Use the Shaped CLI to create the dataset using your configured YAML file:

Terminal

1 shaped create-dataset --file iceberg_dataset.yaml

Shaped will validate the configuration, attempt to assume the role (if specified), connect to the catalog, find the table, and begin syncing data based on the latest Iceberg snapshot. Monitor the status via the Shaped Dashboard or CLI (shaped view-dataset --dataset-name your_iceberg_dataset_name).

What Happens Next? Syncing, Training, Serving from Your Data Lake

Once the Iceberg connection is live:

  1. Initial Sync: Shaped reads the data files corresponding to the latest snapshot of your Iceberg table.
  2. Incremental Syncs: On a schedule (typically hourly by default, but configurable), Shaped checks the Iceberg table's metadata for new snapshots. It then efficiently reads only the data files associated with changes since the last sync.
  3. Model Training: Shaped uses the synced data to train its advanced AI models for search ranking and recommendations.
  4. API Serving: After models are trained, Shaped's APIs are ready to provide personalized results derived directly from the reliable data in your Iceberg data lake.
  5. Continuous Updates: Scheduled syncs and model retraining keep personalization fresh based on the latest committed data in your Iceberg table.

Conclusion: Bridge Your Data Lake and AI Personalization with Iceberg & Shaped

Apache Iceberg brings structure and reliability to your data lake. Shaped's native Iceberg connector allows you to directly leverage this investment, transforming your analytical data foundation into a powerful engine for AI-driven personalization without complex ETL. By securely connecting Shaped to your Iceberg tables, you can activate your most valuable, governed data assets to build state-of-the-art recommendation and search experiences efficiently and effectively.

Ready to activate your Iceberg data lake for intelligent relevance?

Request a demo of Shaped today to see it in action with your specific use case. Or, start exploring immediately with our free trial sandbox.

Get up and running with one engineer in one sprint

Guaranteed lift within your first 30 days or your money back

100M+
Users and items
1000+
Queries per second
1B+
Requests

Related Posts

Tullie Murrell
 | 
June 17, 2025

Explainable Personalization: A Practical Guide for Building Trust and Transparency

Nic Scheltema
 | 
October 4, 2024

How to Calculate and Interpret Precision@K

Tullie Murrell
 | 
June 2, 2025

Glossary: User-Based Collaborative Filtering (UBCF)