H&M Dataset: Powering Personalized Fashion Recommendations at Scale

The H&M Personalized Fashion Recommendations dataset is a favorite in the ML community for testing large-scale, real-world recommendation systems. With millions of transactions and rich metadata, it offers a challenging benchmark for building personalized fashion experiences. In this post, we show how to connect the H&M dataset to Shaped, an AI-native relevance platform, to go beyond basic co-purchase signals. From implicit feedback and cold-start handling to hybrid ranking with item and user features, Shaped helps teams build smarter fashion recommenders, faster.

June 4, 2025

min read

Tullie Murrell

In the dynamic world of fashion retail, providing relevant personalized recommendations is key to customer engagement and sales. The H&M Personalized Fashion Recommendations dataset, notably released as part of a major Kaggle competition, offers an invaluable large-scale resource for researchers and practitioners aiming to tackle this challenge head-on.

This dataset provides a unique window into real-world customer purchase behavior within the fashion domain, presenting distinct opportunities and complexities compared to datasets focused on explicit ratings (like movies or books). Understanding this dataset is essential for anyone developing recommender systems for fashion e-commerce or retail environments.

What is the H&M Personalized Fashion Recommendations Dataset?

This dataset originates from a competition hosted by H&M on Kaggle in 2022. The primary goal was to predict which articles (products) a customer would purchase in the week following a given historical period, based on their previous interactions and metadata. It essentially captures transactional data and associated customer/item information.

The core components include:

Transaction History: Records of customer purchases over a period of time.
Customer Metadata: Basic anonymized information about the customers.
Article Metadata: Detailed information about the clothing items available for purchase.

Key Characteristics & Data Structure

The H&M dataset stands out due to several key characteristics:

Domain: Fast Fashion Retail.
Data Type: Primarily Implicit Feedback (purchase history). Unlike datasets with explicit star ratings, recommendations must be inferred from buying behavior.
Scale: Very large, encompassing millions of customers, over 100,000 unique articles, and hundreds of millions of transactions. This reflects real-world retail scenarios.
Temporal Nature: Transaction data is timestamped (t_dat), making it ideal for sequential recommendation models that capture evolving trends and customer tastes.
Rich Metadata: Includes detailed attributes for both articles and customers.

Core Data Files:

transactions_train.csv: The main interaction file, linking customer_id, article_id, t_dat (timestamp), and price. This is the source of implicit feedback signals.
customers.csv: Contains customer_id and associated features like age, postal_code, and club membership status. Useful for customer segmentation and cold-start scenarios.
articles.csv: Contains article_id and detailed product features like product_code, product_type_name, graphical_appearance_name, colour_group_name, department_name, etc. Essential for content-based filtering and understanding item relationships.
sample_submission.csv: Defines the prediction task format (predicting multiple relevant article_ids for each customer_id).

Why is the H&M Dataset Important for the Recommender Systems community?

This dataset holds significant value within the recommender systems community:

Real-World Scale & Complexity: Offers a challenging, large-scale benchmark reflecting the complexities of real retail environments (sparsity, huge item/user space).
Implicit Feedback Focus: Provides a rich playground for developing and evaluating algorithms designed for implicit signals (purchases), which are more common in e-commerce than explicit ratings.
Sequential Purchase Patterns: The timestamped data is crucial for building models that understand fashion trends, seasonality, and how customer preferences evolve over time.
Rich Feature Engineering: The detailed customer and article metadata encourages sophisticated feature engineering to improve recommendation quality, especially for cold-start users or new items.
Fashion-Specific Challenges: Allows researchers to tackle problems unique to fashion, such as managing vast assortments, capturing style preferences, and dealing with rapid trend cycles.

Strengths of the H&M Dataset

Massive Scale: Reflects real-world retail transaction volumes.
Real-World Implicit Data: Focuses on purchase behavior, common in e-commerce.
Sequential Nature: Timestamps enable modeling temporal dynamics and trends.
Rich Metadata: Detailed customer and article features support hybrid and content-based approaches.
Relevant Business Problem: Directly addresses the practical challenge of personalized fashion recommendations.
Public Benchmark: Provides a common ground for comparing different recommendation strategies via the Kaggle competition results.

Weaknesses & Challenges

Implicit Feedback Ambiguity: Purchases indicate preference, but non-purchase doesn't necessarily mean dislike (could be unawareness, stock issues, price sensitivity). Requires careful handling of negative sampling.
Cold-Start Problem: Recommending items to new users or predicting purchases of new articles remains challenging.
Seasonality & Trends: Fashion is highly dynamic; models need to adapt to changing styles and seasonal demand.
Computational Cost: The sheer scale requires significant computational resources for processing, feature engineering, and model training.
Static Snapshot: Represents a specific historical period; doesn't capture real-time inventory changes or ongoing trends beyond the dataset's timeframe.
Data Sparsity: Despite the volume, individual users purchase only a tiny fraction of the available articles.

Common Use Cases & Applications

Developing and evaluating implicit feedback recommendation algorithms (e.g., ALS, BPR, LightGCN).
Building sequential recommendation models (e.g., GRU4Rec, SASRec, BERT4Rec) to predict next purchases.
Tackling the cold-start problem using content features or hybrid approaches.
Extensive feature engineering combining customer, article, and transaction data.
Analyzing customer purchase behavior and segmentation in fashion.
Modeling fashion trends and seasonality.
Developing hybrid recommender systems combining collaborative, content-based, and sequential signals.

How to Access the H&M Dataset

The dataset is publicly available through the original Kaggle competition page:

Kaggle Competition Page: https://www.kaggle.com/competitions/h-and-m-personalized-fashion-recommendations/data

Users typically need a Kaggle account to download the data files and must agree to the competition's rules/terms of use.

Okay, here's the new section detailing how to connect the H&M dataset to Shaped, using the create-dataset-from-uri method, placed before the conclusion.

Connecting the H&M Dataset to Shaped

Building personalized fashion recommendations with the H&M dataset is a task well-suited for Shaped, which excels at handling large-scale implicit feedback and rich metadata. Connecting this dataset involves preparing the transaction data and potentially the metadata files, then defining how Shaped should use them.

1. Setup: Install the necessary tools and initialize the Shaped client.

    install_and_init.sh
    
pip install shaped pyyaml pandas

import os

SHAPED_API_KEY = os.getenv('TEST_SHAPED_API_KEY', '<YOUR_API_KEY>')

shaped init --api-key $SHAPED_API_KEY

2. Dataset Preparation (Conceptual): Download the core data files (transactions_train.csv, articles.csv, customers.csv) from the Kaggle competition page. The primary focus for interactions is transactions_train.csv.

Map the transaction fields to Shaped's core requirements:

customer_id -> user_id
article_id -> item_id
t_dat (transaction date) -> created_at (needs conversion to Unix epoch seconds/milliseconds)
Keep price as an optional event feature.

You'll need to process transactions_train.csv, perform the timestamp conversion, and save it in a format like JSONL or CSV for upload. For a comprehensive model, you'd likely prepare articles.csv and customers.csv separately for upload as item and user feature datasets.

    prepare_transactions.py
    
  

    
# Example using pandas (assuming transactions_train.csv is downloaded)

import pandas as pd
from datetime import datetime

data_dir = "path/to/hm/data"
transactions_file = f"{data_dir}/transactions_train.csv"

# Load data (potentially chunking for large file)
transactions_df = pd.read_csv(transactions_file, dtype={'article_id': str})  # Keep article_id as string

# Convert date string to epoch timestamp
transactions_df['created_at'] = transactions_df['t_dat'].apply(
    lambda x: int(datetime.strptime(x, '%Y-%m-%d').timestamp())
)

# Rename columns for Shaped standard (optional, can map in fetch)
transactions_df.rename(columns={'customer_id': 'user_id', 'article_id': 'item_id'}, inplace=True)

# Select relevant columns
shaped_transactions_df = transactions_df[['user_id', 'item_id', 'created_at', 'price']]

# Define the path for the prepared file
prepared_file_path = f'{data_dir}/shaped_ready_transactions.jsonl'
# shaped_transactions_df.to_json(prepared_file_path, orient='records', lines=True)
print(f"H&M transaction data conceptually prepared at: {prepared_file_path}")

# --- Similar preparation would be done for articles.csv and customers.csv ---
# articles_df = pd.read_csv(f"{data_dir}/articles.csv", dtype={'article_id': str})
# ... process and save articles_df to articles.jsonl ...
# customers_df = pd.read_csv(f"{data_dir}/customers.csv")
# ... process and save customers_df to customers.jsonl ...

  

3. Create Shaped Datasets using URI: Use the create-dataset-from-uri command to upload the prepared transaction data. Repeat this process for the prepared article and customer metadata files if you created them.

    upload-hm-datasets.sh
    
  

    
# Upload transactions data
shaped create-dataset-from-uri --name hm_transactions \
                             --path path/to/hm/data/shaped_ready_transactions.jsonl \
                             --type jsonl

# Upload articles metadata (if prepared)
shaped create-dataset-from-uri --name hm_articles \
                             --path path/to/hm/data/articles.jsonl \
                             --type jsonl
10
# Upload customers metadata (if prepared)
shaped create-dataset-from-uri --name hm_customers \
                             --path path/to/hm/data/customers.jsonl \
                             --type jsonl

  

Monitor dataset creation using shaped list-datasets.

4. Create Shaped Model: Define the model schema (.yaml), connecting the datasets and specifying how to fetch features. This example shows connecting transactions, articles, and customers.

    hm_model_schema.py
    
  

    
import yaml

dir_path = "hm_assets" # Create if needed

os.makedirs(dir_path, exist_ok=True)

hm_fashion_model_schema = {
    "model": {
        "name": "hm_fashion_recommendations",
        # Objective implicitly becomes ranking/recommendation based on interactions
    },
    "connectors": [
        {
            "type": "Dataset",
            "id": "hm_transactions",
            "name": "transactions"
        },
        {
            "type": "Dataset",
            "id": "hm_articles",
            "name": "articles"
        },
        {
            "type": "Dataset",
            "id": "hm_customers",
            "name": "customers"
        }
    ],
    "fetch": {
        "events": """
SELECT
    customer_id AS user_id,
    article_id AS item_id,
    created_at,
    1 AS label
FROM transactions
""",
        "items": """
SELECT
    article_id AS item_id,
    product_type_name,
    graphical_appearance_name,
    colour_group_name,
    department_name
FROM articles
""",
        "users": """
SELECT
    customer_id AS user_id,
    age,
    club_member_status
FROM customers
"""
    }
}

  

Create the model using the CLI:

    create-model.sh
    
1 shaped create-model --file hm_fashion_model_schema.yaml

Shaped will ingest the transaction events and enrich them with the user and item features automatically handling categorical embeddings and numerical scaling. This allows you to build powerful hybrid recommendations leveraging the rich metadata available in the H&M dataset.

Conclusion: A Benchmark for Modern Fashion Recommendations

The H&M Personalized Fashion Recommendations dataset serves as a critical and challenging benchmark for developing and evaluating modern recommender systems, particularly within the fashion domain. Its massive scale, reliance on implicit feedback from purchase history, rich metadata, and inherent sequential nature accurately reflect many real-world retail scenarios. While it presents significant computational and modeling challenges (like cold start and seasonality), working with this dataset provides invaluable experience in building practical, large-scale personalized recommendation solutions for the dynamic world of fashion e-commerce.

‍Request a demo of Shaped today to see it in action with your specific use case. Or, start exploring immediately with our free trial sandbox.

H&M Dataset: Powering Personalized Fashion Recommendations at Scale

What is the H&M Personalized Fashion Recommendations Dataset?

Key Characteristics & Data Structure

Why is the H&M Dataset Important for the Recommender Systems community?

Strengths of the H&M Dataset

Weaknesses & Challenges

Common Use Cases & Applications

How to Access the H&M Dataset

Connecting the H&M Dataset to Shaped

Conclusion: A Benchmark for Modern Fashion Recommendations

Get up and running with one engineer in one sprint

Related Posts

Connect Your Users: Building "People to Follow" Recommendations

Peering Inside the Black Box: Leveraging User & Item Embeddings

LambdaMART Explained: The Workhorse of Learning-to-Rank