Turning Data Files into Dynamic User Experiences
Amazon S3 is the backbone of countless data strategies, serving as a scalable, durable object storage service for everything from raw event logs and application backups to carefully curated item catalogs and batch-processed user features. While S3 provides an excellent foundation for storing data, the challenge often lies in transforming these static files into the fuel needed for dynamic, real-time AI personalization.
How do you use product catalog files updated nightly in S3 to power relevant recommendations today? How do you train sophisticated machine learning models on large batches of historical interaction data stored as Parquet files without building complex data loading and ML infrastructure? This is where Shaped's direct S3 connector provides a powerful and streamlined solution.
Shaped is an AI-native relevance platform designed to connect directly to your S3 buckets, ingest data from various file formats (Parquet, CSV, TSV, JSONL), train state-of-the-art models, and serve personalized search rankings and recommendations via simple APIs. This post outlines the benefits of connecting your S3 data to Shaped and provides a clear guide to setting up the integration.
Why Connect S3 to Shaped? Unleashing the Potential of Your Stored Data
Connecting your S3 buckets directly to Shaped allows you to activate valuable data assets that might otherwise remain siloed or require significant effort to utilize for real-time applications:
- Leverage Batch Data for Recommendations: Use historical or processed data stored in S3 to power personalization:
- Train on Large Interaction Logs: Build models on comprehensive user behavior histories processed and stored as files (e.g., daily Parquet exports).
- Utilize Curated Item Catalogs: Easily sync detailed product or content metadata from files in S3 to enrich recommendations and enable attribute-based filtering.
- Incorporate Pre-computed User Features: Use user segments or features generated by offline batch jobs and stored in S3 to tailor recommendations.
- Improve Cold-Start Performance: Provide better initial recommendations by ensuring models have access to rich item attributes from catalog files in S3.
- Enhance Search with Static & Historical Data: Improve search relevance using S3 data sources:
- Power Attribute-Based Search: Use detailed item attributes synced directly from catalog files in S3 for powerful filtering and faceting via Shaped's APIs.
- Train Ranking Models on Historical Data: Optimize search ranking by training models on past user engagement or conversion data stored in S3 files.
- Flexible Data Pipelines: Integrate Shaped seamlessly with existing batch processing workflows that output data to S3.
- Simplified ML Workflow: Avoid building custom data loaders, distributed training infrastructure, or complex MLOps pipelines for data residing in S3. Shaped handles the ingestion, training, and serving.
- Scheduled Updates: Keep models fresh by regularly syncing new data files uploaded to S3, allowing Shaped to retrain automatically based on the latest information.
How it Works: The S3 Dataset Connector
Shaped's S3 connector periodically scans a specified path within your S3 bucket for new data files. It securely accesses your bucket using IAM permissions you grant to a Shaped-specific role. Shaped can read data from Parquet, CSV, TSV, or JSON Lines files.
Crucially for incremental updates: To ensure Shaped processes only new data after the initial load without reprocessing old files, your files must be named in a way that they are lexicographically sorted by time. The simplest and recommended way to achieve this is by including a timestamp (e.g., YYYY-MM-DD-HH-MM-SS
or Unix timestamp) as a prefix or suffix in your filenames, ensuring newer files always appear "later" alphabetically.
Example valid naming convention:
s3://your-bucket/path/to/data/events-2024-01-15-10-00-00.parquet.gz
s3://your-bucket/path/to/data/events-2024-01-15-11-00-00.parquet.gz
s3://your-bucket/path/to/data/catalog_1705338000.jsonl
s3://your-bucket/path/to/data/catalog_1705341600.jsonl
Connecting S3 to Shaped

Setting up the connection involves preparing your data and filenames in S3, granting Shaped secure read access, and configuring the dataset in Shaped.
Step 1: Prepare Your Data and Filenames in S3
- File Formats:
- Ensure your data is stored in one of the supported formats: Parquet (
.parquet
, potentially compressed e.g. .parquet.gz
), CSV (.csv
), TSV (.tsv
), or JSON Lines (.jsonl
). - Filename Convention (Critical for Incremental Syncs):
- Name your files within the target S3 path such that new files are always lexicographically greater than older files. Using a timestamp prefix/suffix is the most reliable method (e.g.,
data_YYYYMMDD_HHMMSS.format
). Without this, Shaped may reprocess files or miss new data during incremental syncs. - Consistent Schema: Ensure all files within the path intended for a single Shaped dataset share the same schema (column names and types).
Step 2: Grant Shaped Read-Only Access to Your S3 Bucket
Shaped requires secure, read-only access to list your bucket and get the file objects. This is done by adding a statement to your S3 bucket policy that grants access to Shaped's specific IAM Role ARN.
- Obtain Shaped's IAM Role ARN: Contact the Shaped team (via your support channel or sales contact) to get the precise
Principal
ARN for Shaped'sCustomerS3DataAccessRole
. This will look something likearn:aws:iam:::role/CustomerS3DataAccessRole
.
- Edit Your S3 Bucket Policy: Navigate to your S3 bucket in the AWS console, go to the "Permissions" tab, and edit the "Bucket policy".
- Add Policy Statement: Add the following JSON statement to your existing policy (or create a new policy if one doesn't exist). Replace
{your_bucket}
with your actual bucket name and<shaped_account_id>
with the ID provided by Shaped.
- Save Changes: Save the updated bucket policy.
Step 3: Configure the Shaped Dataset (YAML)
Define the S3 location, file format, data schema, and unique keys in a Shaped dataset configuration file.
Create a YAML file (e.g., s3_dataset.yaml
):
Key Configuration Points:
schema_type
: Must beCUSTOM
for the S3 connector.
column_schema
: You must define the schema accurately, matching the columns and types in your S3 files. Ensure timestamp columns are marked correctly.s3_path
: Specify the correct bucket and path. Use a trailing / for a directory or a glob pattern (*) for specific file matching.s3_format
: Must match the actual format of your files.unique_keys
: Important for ensuring data integrity, especially if data might be uploaded with slight overlaps or corrections.
Step 4: Create the Dataset in Shaped
Use the Shaped CLI to create the dataset using your configured YAML file:
Shaped will validate the configuration, check permissions, and begin the initial data sync by listing and reading files from your S3 path. Monitor progress and status via the Shaped Dashboard or CLI (shaped view-dataset --dataset-name
your_s3_dataset_name
).
What Happens Next? Syncing, Modeling, Serving

Once the S3 connection is established:
- Initial Sync: Shaped reads all existing files matching the s3_path and s3_format based on their lexicographical order.
- Incremental Syncs: On the defined schedule_interval (default: hourly), Shaped lists the
s3_path
and checks for any new files (whose names are lexicographically greater than the last file processed). It then ingests data only from these new files. - Model Training: Shaped uses the synced data from S3 to train its powerful AI models for search and recommendations.
- API Serving: After models are trained, Shaped's APIs are ready to provide personalized results derived from your S3 data.
- Ongoing Updates: As you upload new data files (following the naming convention!) to S3, Shaped automatically picks them up on its next scheduled sync and incorporates them into subsequent model training runs.
Conclusion: Activate Your S3 Data Lake for Intelligent Personalization
Your data lake in S3 holds immense potential for driving personalization. Shaped's S3 connector provides a direct, efficient way to bridge this data store with state-of-the-art AI models, bypassing the need for complex custom pipelines and ML infrastructure. By correctly formatting your data files, setting up secure access, and configuring the connection in Shaped, you can easily activate your S3 data to power dynamic, personalized recommendations and search experiences that engage users and drive results.
Ready to unlock the AI potential of your data stored in S3?
Request a demo of Shaped today to see it in action with your specific use case. Or, start exploring immediately with our free trial sandbox.