Criteo Dataset: Tackling Large-Scale Click-Through Rate Prediction

Click-through rate (CTR) prediction is central to modern advertising and recommendation systems, and the Criteo dataset has become the de facto benchmark for advancing this task at industrial scale. With hundreds of millions to billions of rows and a blend of dense numerical and sparse categorical features, it poses unique modeling and computational challenges. This article unpacks the dataset’s structure, scale, and role in driving innovations like embedding techniques and hybrid model architectures—offering a clear lens into why Criteo remains a crucial resource for anyone building large-scale machine learning systems.

In the world of computational advertising and online recommendations, accurately predicting the likelihood of a user clicking on an ad (Click-Through Rate or CTR) is paramount. The Criteo datasets, released by Criteo AI Lab, have become cornerstone benchmarks for developing and evaluating machine learning models designed for this critical task.

These datasets are renowned for their massive scale and challenging mix of features, reflecting the complexities of real-world display advertising data. Understanding the Criteo dataset is essential for anyone working on CTR prediction, large-scale machine learning systems, or handling high-dimensional sparse data.

What is the Criteo Dataset?

"Criteo dataset" typically refers to several large-scale datasets released by Criteo, derived from anonymized traffic logs of their display advertising platform. The primary goal associated with these datasets is binary classification: predicting whether a displayed ad was clicked (label = 1) or not (label = 0).

Key components include:

  1. Click Label: The target variable indicating if a click occurred.
  2. Numerical Features (Dense): A set of anonymized features representing counts or other numerical measurements (e.g., related to user browsing behavior, ad properties).
  3. Categorical Features (Sparse): A set of anonymized features representing categorical information (e.g., user ID, ad ID, publisher ID, device type). These features are often high-cardinality, meaning they have many unique possible values, leading to high-dimensional sparse representations.

Key Characteristics & Versions

The Criteo datasets are defined by:

  • Domain: Computational Advertising / Display Advertising.
  • Primary Task: CTR Prediction (Binary Classification).
  • Scale: Extremely large, often ranging from tens of millions of samples (Kaggle versions) to billions of samples (Terabyte Click Logs).
  • Feature Mix: A characteristic blend of dense (numerical) and high-cardinality sparse (categorical) features. This mix presents unique modeling challenges.
  • Data Format: Typically provided in tab-separated value (TSV) format, with columns for the label, dense features, and categorical features. Features are anonymized.
  • Sparsity: The categorical features lead to extremely high-dimensional and sparse input data when one-hot encoded or embedded.

Popular Versions:

  • Criteo Kaggle Display Advertising Challenge Dataset (2014): A widely used version with ~45 million samples, 13 numerical features, and 26 categorical features. A standard benchmark.
  • Criteo Terabyte Click Logs: A massive dataset (over 1TB compressed) containing billions of events, offering a challenge at an even larger scale.

Why is the Criteo Dataset Important?

Its significance stems from several factors:

  1. Industry Standard CTR Benchmark: It's one of the most widely recognized public benchmarks for evaluating CTR prediction models, allowing for direct comparison of different approaches.
  2. Challenge for Large-Scale ML: Its sheer size tests the scalability and efficiency of machine learning algorithms and systems.
  3. Handling High-Dimensional Sparse Data: The numerous high-cardinality categorical features make it ideal for developing and testing techniques specifically designed for sparse data (e.g., embedding layers, factorization machines).
  4. Real-World Relevance: While anonymized, the data structure and task closely mirror real challenges faced in the online advertising industry.
  5. Driving Model Innovation: Has spurred research into specialized model architectures that efficiently combine dense and sparse features (e.g., Factorization Machines (FM), Field-aware FM (FFM), DeepFM, DCN, Wide & Deep).
  6. Relevance to Recommendations: Predicting clicks is a form of interaction prediction, a core task in many recommender systems, especially in scenarios like sponsored product recommendations or ad targeting.

Strengths of the Criteo Dataset

  • Massive Scale: Provides data volumes representative of real-world industrial applications.
  • Realistic Feature Mix: Contains both dense numerical and sparse categorical features, common in web-scale data.
  • Standardized Benchmark: Facilitates fair comparison of different CTR prediction models.
  • Direct Industry Relevance: Addresses a core problem in computational advertising.
  • Publicly Available: Accessible for academic research and industry practitioners.

Weaknesses & Challenges

  • Computational Cost: Processing and training models on these datasets require significant computational resources (memory, CPU/GPU time).
  • Feature Anonymization: Features lack semantic meaning, making feature interpretation difficult and limiting some types of feature engineering.
  • Extreme Sparsity: High-cardinality categorical features lead to very high dimensions, posing challenges for many standard algorithms.
  • Static Snapshot: Represents data from a specific period; doesn't capture evolving user behavior or ad inventory dynamically.
  • Focus Solely on CTR: Doesn't include other potential objectives like conversions or downstream user value.

Common Use Cases & Applications

  • Benchmarking CTR prediction models (Logistic Regression, FM, FFM, Deep Learning models like Wide & Deep, DeepFM, DCN, xDeepFM, AutoInt, etc.).
  • Developing and evaluating feature engineering techniques for sparse data (e.g., hashing tricks, embeddings).
  • Testing the scalability and performance of distributed machine learning systems.
  • Research into embedding methods for high-cardinality categorical features.
  • Evaluating techniques for handling the dense/sparse feature interaction challenge.

How to Access the Criteo Datasets

The primary sources for accessing the Criteo datasets are:

Access typically requires agreeing to specific terms of use or competition rules.

Connecting the Criteo Dataset to Shaped

Shaped is well-suited to handle the mix of dense and sparse features common in datasets like Criteo, making it easy to build powerful CTR or recommendation models without extensive manual feature engineering for embeddings. Here's how you might conceptually connect the Criteo Kaggle dataset:

1. Setup: Install prerequisites and initialize the Shaped client.

init_shaped.sh

1 pip install shaped
2 
3 export SHAPED_API_KEY=${SHAPED_API_KEY:-<YOUR_API_KEY>}
4 
5 shaped init --api-key $SHAPED_API_KEY
    

2. Dataset Preparation (Conceptual):

Download the Criteo Kaggle TSV file (train.txt). You'll likely need to handle missing values (often filled with 0 for numerical, empty string for categorical) and assign proper column names (label, I1...I13, C1...C26).

For Shaped's standard recommendation models, you need user_id, item_id, and created_at. Since Criteo lacks these explicitly:

  • Proxy IDs:
  • You might designate certain high-cardinality categorical features as proxies (e.g., use C1 as user_id and C2 as item_id).
  • Timestamp: The Kaggle dataset lacks timestamps. You might need to add a synthetic timestamp (e.g., based on row order if assuming sequential nature, though this is an approximation) or use a version that includes time if available. For simplicity here, we'll generate a synthetic timestamp.
prepare_criteo.py

1 # Example using pandas (assuming train.txt is downloaded)
2 
3 import pandas as pd
4 
5 data_dir = "path/to/criteo/data"
6 
7 # Define column names
8 cols = ['label'] + [f'I{i}' for i in range(1, 14)] + [f'C{i}' for i in range(1, 27)]
9 
10 criteo_df = pd.read_csv(f'{data_dir}/train.txt', sep='\t', names=cols, na_values='-')
11 
12 # Add cleaning steps: Handle NaNs (e.g., fillna)
13 criteo_df.fillna({'I1': 0, ..., 'C1': '', ...}, inplace=True)
14 
15 # Assign proxy user/item IDs and create timestamp
16 criteo_df['user_id'] = criteo_df['C1'] # Example proxy
17 criteo_df['item_id'] = criteo_df['C2'] # Example proxy
18 criteo_df['created_at'] = range(len(criteo_df)) # Synthetic timestamp (epoch seconds)
19 
20 # Save relevant columns (including other features) to JSONL for Shaped
21 criteo_df.to_json(f'{data_dir}/shaped_ready_criteo.jsonl', orient='records', lines=True)
22 
23 print("Criteo data conceptually prepared.")
    

Note: For pure CTR prediction where user/item identity isn't the focus, you might structure the input differently, perhaps using a session ID or impression ID if available.

3. Create Shaped Dataset: Define the dataset schema. Again, assuming local upload via CLI for simplicity.

generate_yaml.py

1 import yaml
2 
3 dir_path = "criteo_assets" # Create this directory if needed
4 
5 os.makedirs(dir_path, exist_ok=True)
6 
7 criteo_dataset_schema = {
8     "name": "criteo_events",
9     "schedule_interval": "@daily"
10     # Add cloud storage connector details (S3, BQ) for production
11 }
12 
13 with open(f'{dir_path}/criteo_dataset_schema.yaml', 'w') as file:
14     yaml.dump(criteo_dataset_schema, file)
    

Create the dataset:

upload_dataset.sh

1 shaped create-dataset --file $dir_path/criteo_dataset_schema.yaml
2 
3 # Ensure your cleaned JSONL file exists
4 shaped dataset-insert --dataset-name criteo_events \
5                       --file path/to/criteo/data/shaped_ready_criteo.jsonl \
6                       --type 'jsonl'
    

4. Create Shaped Model: Define the model schema. This is where you tell Shaped how to use the various features. Shaped automatically handles embedding the categorical features (C*) and utilizing the numerical ones (I*).

generate_model_schema.py

1 import yaml
2 
3 # Construct feature selection string dynamically
4 numerical_features = ", ".join([f"I{i}" for i in range(1, 14)])
5 categorical_features = ", ".join([f"C{i}" for i in range(1, 27)]) # Exclude C1/C2 if used as IDs
6 
7 criteo_model_schema = {
8     "model": {
9         "name": "criteo_ctr_model"
10         # Optionally specify model type or objectives if needed
11     },
12     "connectors": [
13         {
14             "type": "Dataset",
15             "id": "criteo_events",
16             "name": "criteo_events"
17         }
18     ],
19     "fetch": {
20         # Map Criteo fields and include all other numerical/categorical features
21         "events": f"""
22 SELECT
23 user_id, -- Or the original C1 field if not aliased earlier
24 item_id, -- Or the original C2 field if not aliased earlier
25 label, -- The click label (0 or 1)
26 created_at, -- The generated timestamp
27 {numerical_features},
28 {categorical_features} -- Shaped automatically embeds these
29 FROM criteo_events
30 """
31         # You can define separate user/item features if available in other tables/datasets
32     }
33 }
34 
35 with open(f'{dir_path}/criteo_model_schema.yaml', 'w') as file:
36     yaml.dump(criteo_model_schema, file)
    

Create the model:

create_model.sh

1 shaped create-model --file $dir_path/criteo_model_schema.yaml
    

Shaped will then train a model leveraging all the provided dense and sparse features to predict the label (click probability), automatically handling the complexities of high-dimensional sparse feature embedding.

Conclusion: A Crucial Benchmark for CTR Prediction and Large-Scale ML

The Criteo datasets represent indispensable benchmarks in the field of computational advertising and large-scale machine learning. Their massive scale and characteristic mix of dense and high-cardinality sparse features provide a realistic and challenging testbed for CTR prediction models. While demanding significant computational resources and presenting challenges due to feature anonymization, the Criteo datasets have driven substantial innovation in model architectures and techniques for handling sparse data effectively. They remain essential resources for researchers and practitioners aiming to develop state-of-the-art solutions for predicting user interactions in online environments, a task fundamental to both advertising and aspects of modern recommender systems.

Request a demo of Shaped today to see it in action with your specific use case. Or, start exploring immediately with our free trial sandbox.

Get up and running with one engineer in one sprint

Guaranteed lift within your first 30 days or your money back

100M+
Users and items
1000+
Queries per second
1B+
Requests

Related Posts

Daniel Oliver Belando
 | 
June 1, 2023

How synthetic data is used to train machine-learning models

Nic Scheltema
 | 
October 16, 2024

Deep Reinforcement Learning for Recommender Systems

Nic Scheltema
 | 
November 7, 2024

How to Implement Effective Caching Strategies for Recommender Systems