Beyond Static Preferences: Understanding Sequential Models in Recommendation Systems (N-Gram, SASRec, BERT4Rec & Beyond)

In a world where user behavior changes by the minute, traditional recommendation systems fall short. Sequential recommendation models offer a powerful upgrade, capturing evolving intent by analyzing the order of interactions. This article breaks down the evolution of these models, from simple N-Grams to advanced Transformers and Generative Recommenders like HSTU. It also explores the real-world challenges of deploying them and how platforms like Shaped make cutting-edge sequential modeling accessible, scalable, and production-ready.

May 6, 2025

min read

Tullie Murrell

Traditional recommendation models often treat user preferences as relatively static, learned from a long history of unordered interactions. However, user behavior is rarely static. User interests evolve, sometimes rapidly within a single session. What you click on now heavily influences what you might click on next. Buying a new phone might lead to searching for cases; watching one episode of a series makes the next episode highly relevant.

This is where Sequential Recommendation Models come in. They explicitly leverage the order of user interactions to predict future behavior. By understanding the sequence, these models can capture short-term intent, evolving interests, and contextual dependencies that static models often miss.

From simple sequence counting to powerful Transformer architectures and now even Generative Recommenders inspired by Large Language Models (LLMs), sequential modeling has become a cornerstone of modern RecSys, particularly for session-based recommendations, next-item prediction, and understanding user journeys.

This post explores the world of sequential recommendation models:

Why sequence matters in recommendations.
The evolution from simple N-Grams to sophisticated Transformers and Generative models.
Key architectures explained: N-Gram, Item2Vec, SASRec, BERT4Rec, GSASRec, and Generative approaches (HSTU).
The challenges in building and deploying these systems.
How platforms like Shaped harness sequential models.
Future directions in sequential recommendation research.

Why Sequence Matters: Capturing User Dynamics

Imagine a user's interaction history: [Item A (Sci-Fi Movie), Item B (Action Movie), Item C (Sci-Fi Book)].

Non-Sequential View: A model might infer a general interest in Sci-Fi and Action.
Sequential View: A model might notice the user shifted from movies to books within the Sci-Fi genre. The next recommendation might be another Sci-Fi book, leveraging the immediate context.

Sequential models aim to answer: "Given the user's recent sequence of actions, what are they likely to interact with next?" This is crucial for:

Session-Based Recommendations: Recommending relevant items during an active browsing session.
Next-Item Prediction: Powering features like "next video to watch" or "next song in playlist."
Understanding Evolving Intent: Capturing shifts in user interest over time or within a session.

The Evolution of Sequential Modeling in RecSys

Capturing sequential patterns has evolved significantly:

Markov Chains (MCs) & N-Grams: Early approaches based on short-term transitions. Limited memory.
Factorizing Personalized Markov Chains (FPMC): Combined MF with MCs.
Embedding-based Methods (Item2Vec): Learned item embeddings based on co-occurrence within sequences, often ignoring strict order.
Recurrent Neural Networks (RNNs - GRU/LSTM): Processed sequences step-by-step, theoretically capturing longer history but facing training/parallelization issues.
Attention Mechanisms & Transformers (SASRec, BERT4Rec): Revolutionary models that allow direct attention to all relevant past items, enabling parallel training and better historical understanding.
Generative Recommenders (GRs) & Advanced Transformers (HSTU): Inspired by LLM success, treats user actions as language, aiming to generate future interactions and overcome scaling limitations of previous DLRMs.

Key Sequential Architectures Explained

Let's look at the key models:

1. N-Gram Models

Concept: Frequency-based approach using fixed-length subsequences (n-grams).
How it works: Predicts the next item based on the conditional probability derived from counts of the preceding n-1 items.
Pros: Simple, cheap, interpretable for short patterns.
Cons: Limited memory, data sparsity for large n, doesn't generalize or learn item context.

2. Item2Vec (Word2Vec Adaptation)

Concept: Applies Word2Vec to item sequences, learning embeddings based on co-occurrence within a context window.
Pros: Learns meaningful item similarity embeddings efficiently.
Cons: Ignores strict sequential order, limited context window.

3. SASRec (Self-Attentive Sequential Recommendation)

Concept: Applies the Transformer encoder with self-attention to predict the next item based on past interactions.
Pros: Effectively captures long-range dependencies, parallelizable, strong performance.
Cons: Computationally intensive, hyperparameter sensitive.

4. BERT4Rec (Bidirectional Encoder Representations from Transformers for Recommendation)

Concept: Adapts BERT using bidirectional self-attention and a masking prediction objective (predicting masked items based on surrounding context).
Pros: Learns rich item representations using bidirectional context.
Cons: Less natural for strict next-item prediction inference; training objective differs from a typical recommender use-case.

5. GSASRec (Generalized Self-Attentive Sequential Recommendation)

Concept: An improvement over SASRec aimed at reducing overconfidence from negative sampling during training.
Pros: Better calibrated and potentially more reliable predictions.
Cons: Slight addition in complexity to SASRec

6. Generative Recommenders (GRs) & HSTU

Concept (Paradigm Shift): Treats sequences of user actions (clicks, views, purchases across different features) as a language. Instead of just predicting the next item, it aims to generate plausible future sequences, tackling ranking and retrieval tasks within this generative framework. (Inspired by Zhai et al., ICML'24)
Addressing Challenges: Acknowledges key difficulties distinct from NLP:
- Feature Complexity: Handles different data types (categorical, numerical, sparse, dense) through feature sequentialization (merging fast/slow timelines, potentially dropping inferred dense features).
- Vocabulary Explosion: Deals with billions of dynamic user/item IDs.
- Computational Scale: Requires architectures optimized for potentially trillions of interaction tokens daily.
HSTU (Hierarchical Sequential Transduction Unit): A novel Transformer encoder architecture proposed for GRs. Each HSTU layer condenses typical DLRM stages (feature extraction, interaction, transformation) into repeatable sub-layers (Pointwise Projection, Spatial Aggregation, Pointwise Transformation) for efficiency at scale.
M-FALCON: An efficient inference algorithm using microbatching and caching, allowing GRs to scale complexity linearly with candidates during ranking.
Key Findings: GRs/HSTU demonstrated superior performance to previous models (like SASRec) on benchmarks and achieved significant engagement gains (+12.4%) in large-scale production A/B tests at Meta. Crucially, unlike traditional DLRMs which often plateau, GRs showed improved performance scaling with increased data and compute.
Pros: Represents the state-of-the-art, potential for deeper understanding of user trajectories, overcomes scaling limitations of older DLRMs, powerful generative capabilities.
Cons: Extremely computationally expensive, complex architecture, evaluation metrics still evolving for generative tasks.

Building From Scratch: The Challenges

Implementing sequential models, especially advanced ones, involves hurdles:

Data Preprocessing: Handling sequences, padding/truncation, masking, feature sequentialization (for GRs).
Computational Cost: Significant GPU resources needed, especially for large Transformers and GRs.
Hyperparameter Tuning: Complex models have many parameters requiring optimization.
Evaluation Metrics: Defining appropriate metrics beyond simple next-item accuracy, especially for generative models.
Cold Start: Still challenging for new users with no history.
Scalability (GRs): While GRs scale better with compute, the absolute compute required is enormous.

Sequential Models in Practice: The Shaped Approach

Shaped provides managed implementations for many sequential models, simplifying deployment:

N-Gram:

1 model:
2     name: ngram-session-recs
3     policy_configs:
4         scoring_policy: { policy_type: ngram, n: 3, laplace_smoothing: 0.01 }

Item2Vec:

1 model:
2     name: item2vec-embeddings
3     policy_configs:
4         embedding_policy: { policy_type: item2vec, embedding_size: 128, window_size: 5 }

SASRec / BERT4Rec / GSASRec:

1 model:
2     name: transformer-sequential-recs
3     policy_configs:
4         scoring_policy:
5             policy_type: sasrec
6             hidden_size: 128
7             n_heads: 4
8             n_layers: 2
9             max_seq_length: 50
10             # ... other training params

Generative / HSTU-like Models:

 1 model:
 2     name: generative-future-predictor
 3     policy_configs:
 4         scoring_policy:
 5             policy_type: hstu-generative # Hypothetical policy type for a GM
 6             # Core Transformer params
 7             hidden_size: 512              # Likely larger dimensions
 8             n_heads: 8
 9             n_layers: 12                  # Likely deeper models
10             max_seq_length: 1024          # Handling longer sequences
11             # Generative specific params (example)
12             generation_mode: ranking      # Or 'retrieval', 'simulation'
13             beam_size: 5                  # For beam search generation
14             # Feature sequentialization / architecture params (example)
15             feature_aggregation: spatial  # Reflecting HSTU structure
16             # Training params
17             batch_size: 1024               # Large batch sizes typical
18             learning_rate: 0.0001

Shaped manages the underlying complexities, making these powerful sequential approaches accessible.

Future Research Directions

Sequential recommendation is white-hot:

Generative Models: Refining generation control, efficiency, and evaluation. Exploring different sequence-to-sequence architectures.
Scaling Laws: Further understanding and exploiting the scaling properties of different architectures (like HSTU).
LLMs for RecSys: Effectively leveraging giant pre-trained language (and multimodal) models for sequence understanding.
Context & Causality: Incorporating richer real-time context and moving towards causal understanding of sequential choices.
Efficiency & Real-time: Developing lighter, faster models suitable for on-device or extremely low-latency scenarios.
Multi-Task & Reinforcement Learning: Training sequence models for multiple objectives or using RL to optimize long-term user engagement.

Conclusion: Predicting the Now and the Next

Sequential recommendation models are essential for capturing user dynamics. The journey from simple N-Grams to attention-based Transformers (SASRec, BERT4Rec) represented a huge leap. Now, the emergence of Generative Recommenders (like those powered by HSTU) signifies another potential paradigm shift, treating user actions as a language and breaking previous scaling barriers.

While complexity and computational costs increase, the ability to model sequences with ever-greater fidelity unlocks more timely, relevant, and engaging personalized experiences. Platforms like Shaped are crucial in democratizing access to these cutting-edge techniques, paving the way for the next generation of recommendation systems.