Activate Your S3 Data Lake for AI Personalization with Shaped

Amazon S3 is great for storing batch data, but turning that static data into dynamic, personalized experiences usually requires heavy ML infrastructure. With Shaped’s direct S3 connector, you can skip the complexity. Simply point Shaped to your S3 bucket (CSV, Parquet, JSONL, etc.), and it will automatically ingest data, train relevance models, and serve real-time personalized recommendations and search via API. No pipelines, no custom loaders, just plug in your data lake and activate it for intelligent discovery.

June 18, 2025

min read

Tullie Murrell

Turning Data Files into Dynamic User Experiences

Amazon S3 is the backbone of countless data strategies, serving as a scalable, durable object storage service for everything from raw event logs and application backups to carefully curated item catalogs and batch-processed user features. While S3 provides an excellent foundation for storing data, the challenge often lies in transforming these static files into the fuel needed for dynamic, real-time AI personalization.

How do you use product catalog files updated nightly in S3 to power relevant recommendations today? How do you train sophisticated machine learning models on large batches of historical interaction data stored as Parquet files without building complex data loading and ML infrastructure? This is where Shaped's direct S3 connector provides a powerful and streamlined solution.

Shaped is an AI-native relevance platform designed to connect directly to your S3 buckets, ingest data from various file formats (Parquet, CSV, TSV, JSONL), train state-of-the-art models, and serve personalized search rankings and recommendations via simple APIs. This post outlines the benefits of connecting your S3 data to Shaped and provides a clear guide to setting up the integration.

Why Connect S3 to Shaped? Unleashing the Potential of Your Stored Data

Connecting your S3 buckets directly to Shaped allows you to activate valuable data assets that might otherwise remain siloed or require significant effort to utilize for real-time applications:

Leverage Batch Data for Recommendations: Use historical or processed data stored in S3 to power personalization:
- Train on Large Interaction Logs: Build models on comprehensive user behavior histories processed and stored as files (e.g., daily Parquet exports).
- Utilize Curated Item Catalogs: Easily sync detailed product or content metadata from files in S3 to enrich recommendations and enable attribute-based filtering.
- Incorporate Pre-computed User Features: Use user segments or features generated by offline batch jobs and stored in S3 to tailor recommendations.
- Improve Cold-Start Performance: Provide better initial recommendations by ensuring models have access to rich item attributes from catalog files in S3.
Enhance Search with Static & Historical Data: Improve search relevance using S3 data sources:
- Power Attribute-Based Search: Use detailed item attributes synced directly from catalog files in S3 for powerful filtering and faceting via Shaped's APIs.
- Train Ranking Models on Historical Data: Optimize search ranking by training models on past user engagement or conversion data stored in S3 files.
Flexible Data Pipelines: Integrate Shaped seamlessly with existing batch processing workflows that output data to S3.
Simplified ML Workflow: Avoid building custom data loaders, distributed training infrastructure, or complex MLOps pipelines for data residing in S3. Shaped handles the ingestion, training, and serving.
Scheduled Updates: Keep models fresh by regularly syncing new data files uploaded to S3, allowing Shaped to retrain automatically based on the latest information.

How it Works: The S3 Dataset Connector

Shaped's S3 connector periodically scans a specified path within your S3 bucket for new data files. It securely accesses your bucket using IAM permissions you grant to a Shaped-specific role. Shaped can read data from Parquet, CSV, TSV, or JSON Lines files.

Crucially for incremental updates: To ensure Shaped processes only new data after the initial load without reprocessing old files, your files must be named in a way that they are lexicographically sorted by time. The simplest and recommended way to achieve this is by including a timestamp (e.g., YYYY-MM-DD-HH-MM-SS or Unix timestamp) as a prefix or suffix in your filenames, ensuring newer files always appear "later" alphabetically.

Example valid naming convention:

s3://your-bucket/path/to/data/events-2024-01-15-10-00-00.parquet.gz
s3://your-bucket/path/to/data/events-2024-01-15-11-00-00.parquet.gz
s3://your-bucket/path/to/data/catalog_1705338000.jsonl
s3://your-bucket/path/to/data/catalog_1705341600.jsonl

Connecting S3 to Shaped

Setting up the connection involves preparing your data and filenames in S3, granting Shaped secure read access, and configuring the dataset in Shaped.

Step 1: Prepare Your Data and Filenames in S3

File Formats:
Ensure your data is stored in one of the supported formats: Parquet (.parquet, potentially compressed e.g. .parquet.gz), CSV (.csv), TSV (.tsv), or JSON Lines (.jsonl).
Filename Convention (Critical for Incremental Syncs):
Name your files within the target S3 path such that new files are always lexicographically greater than older files. Using a timestamp prefix/suffix is the most reliable method (e.g., data_YYYYMMDD_HHMMSS.format). Without this, Shaped may reprocess files or miss new data during incremental syncs.
Consistent Schema: Ensure all files within the path intended for a single Shaped dataset share the same schema (column names and types).

Step 2: Grant Shaped Read-Only Access to Your S3 Bucket

Shaped requires secure, read-only access to list your bucket and get the file objects. This is done by adding a statement to your S3 bucket policy that grants access to Shaped's specific IAM Role ARN.

Obtain Shaped's IAM Role ARN: Contact the Shaped team (via your support channel or sales contact) to get the precise Principal ARN for Shaped's CustomerS3DataAccessRole. This will look something like arn:aws:iam:::role/CustomerS3DataAccessRole.

Edit Your S3 Bucket Policy: Navigate to your S3 bucket in the AWS console, go to the "Permissions" tab, and edit the "Bucket policy".

Add Policy Statement: Add the following JSON statement to your existing policy (or create a new policy if one doesn't exist). Replace {your_bucket} with your actual bucket name and <shaped_account_id> with the ID provided by Shaped.

    s3_bucket_policy.json
    
  

    {
  "Version": "2025-10-17",
  "Statement": [
    // ... [Your existing policy statements, if any] ...
    {
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam:::role/CustomerS3DataAccessRole"
      },
      "Action": [
        "s3:GetObject",        // Allows Shaped to read file contents
        "s3:ListBucket"        // Allows Shaped to list files in the specified path
      ],
      "Resource": [
        "arn:aws:s3:::{your_bucket}",       // Required for ListBucket on the bucket itself
        "arn:aws:s3:::{your_bucket}/*"      // Required for GetObject on files within the bucket
      ]
      // Optional: Add a Condition element here to restrict access to a specific path prefix if desired
      // "Condition": {
      //   "StringLike": {
      //     "s3:prefix": ["path/to/your/data/*"]
      //   }
      // }
    }
  ]
}

  

Save Changes: Save the updated bucket policy.

Step 3: Configure the Shaped Dataset (YAML)

Define the S3 location, file format, data schema, and unique keys in a Shaped dataset configuration file.

Create a YAML file (e.g., s3_dataset.yaml):

    s3_dataset.yaml
    
  

    
# s3_dataset.yaml
name: your_s3_dataset_name # Choose a descriptive name

# --- Required Fields ---

# Use CUSTOM schema_type for S3 connector
schema_type: CUSTOM

# Define the schema of the data within your S3 files.
# Supported types: STRING, INT, FLOAT, BOOLEAN, TIMESTAMP, ARRAY<STRING>, ARRAY<INT>, ARRAY<FLOAT>
column_schema:
  user_id: STRING
  item_id: STRING
  event_type: STRING
  timestamp: TIMESTAMP # Must be ISO 8601 or epoch
  price: FLOAT
  category: STRING

# Path to your data within the S3 bucket.
s3_path: "s3://your-bucket-name/path/to/your/data/"

# Format of the files in the specified path.
s3_format: PARQUET

# Used for deduplication
unique_keys: ["user_id", "item_id", "timestamp"]

# --- Optional Fields ---
# description: "Batch processed user events from S3"
# schedule_interval: "@daily"
    
  

Key Configuration Points:

schema_type: Must be CUSTOM for the S3 connector.

column_schema: You must define the schema accurately, matching the columns and types in your S3 files. Ensure timestamp columns are marked correctly.
s3_path: Specify the correct bucket and path. Use a trailing / for a directory or a glob pattern (*) for specific file matching.
s3_format: Must match the actual format of your files.
unique_keys: Important for ensuring data integrity, especially if data might be uploaded with slight overlaps or corrections.

Step 4: Create the Dataset in Shaped

Use the Shaped CLI to create the dataset using your configured YAML file:

    create_s3_dataset.sh
    
1 shaped create-dataset --file s3_dataset.yaml

Shaped will validate the configuration, check permissions, and begin the initial data sync by listing and reading files from your S3 path. Monitor progress and status via the Shaped Dashboard or CLI (shaped view-dataset --dataset-name your_s3_dataset_name).

What Happens Next? Syncing, Modeling, Serving

Once the S3 connection is established:

Initial Sync: Shaped reads all existing files matching the s3_path and s3_format based on their lexicographical order.
Incremental Syncs: On the defined schedule_interval (default: hourly), Shaped lists the s3_path and checks for any new files (whose names are lexicographically greater than the last file processed). It then ingests data only from these new files.‍
Model Training: Shaped uses the synced data from S3 to train its powerful AI models for search and recommendations.‍
API Serving: After models are trained, Shaped's APIs are ready to provide personalized results derived from your S3 data.‍
Ongoing Updates: As you upload new data files (following the naming convention!) to S3, Shaped automatically picks them up on its next scheduled sync and incorporates them into subsequent model training runs.

Conclusion: Activate Your S3 Data Lake for Intelligent Personalization

Your data lake in S3 holds immense potential for driving personalization. Shaped's S3 connector provides a direct, efficient way to bridge this data store with state-of-the-art AI models, bypassing the need for complex custom pipelines and ML infrastructure. By correctly formatting your data files, setting up secure access, and configuring the connection in Shaped, you can easily activate your S3 data to power dynamic, personalized recommendations and search experiences that engage users and drive results.

Ready to unlock the AI potential of your data stored in S3?

‍Request a demo of Shaped today to see it in action with your specific use case. Or, start exploring immediately with our free trial sandbox.

Activate Your S3 Data Lake for AI Personalization with Shaped

Turning Data Files into Dynamic User Experiences

Why Connect S3 to Shaped? Unleashing the Potential of Your Stored Data

How it Works: The S3 Dataset Connector

Connecting S3 to Shaped

Step 1: Prepare Your Data and Filenames in S3

Step 2: Grant Shaped Read-Only Access to Your S3 Bucket

Step 3: Configure the Shaped Dataset (YAML)

Step 4: Create the Dataset in Shaped

What Happens Next? Syncing, Modeling, Serving

Conclusion: Activate Your S3 Data Lake for Intelligent Personalization

Get up and running with one engineer in one sprint

Related Posts

RAG for RecSys: a magic formula?

Building Real-Time AI Recommendations and Search with Amplitude and Shaped

Glossary: Cosine Similarity