Bringing Collaborative Filtering to LLMs with AdaptRec

TL;DR: LLMs are powerful, but making them use collaborative filtering (CF) signals effectively for sequential recommendations is tricky. AdaptRec proposes a self-adaptive framework where the LLM actively selects which similar users to learn from (via a "User-based Similarity Retrieval Prompt") and then uses their histories as demonstrations in a "User-Contextualized Recommendation Prompt." Results show significant HR@1 improvements (7-18%) over traditional and other LLM-based methods, especially in few-shot scenarios.

May 30, 2025

min read

Nic Scheltema

The integration of Large Language Models (LLMs) into sequential recommendation systems (LLM4SRec) is a hotbed of research right now. We're all seeing the potential of LLMs' world knowledge and in-context learning abilities. However, a core tenet of traditional RecSys is leveraging collaborative signals – learning from the behavior of similar users. Translating these rich, often numerical or ID-based collaborative signals into a format that LLMs can understand and truly reason with remains a significant hurdle.

The authors of AdaptRec: A Self-Adaptive Framework for Sequential Recommendations with Large Language Models by Zhang, Wang, Li, et al. pinpoint a key issue: while some recent approaches try to feed similar user sequences into LLM prompts, the selection of these "demonstrations" is often based on static, numerical similarity (like cosine similarity on embeddings) and doesn't truly align with the LLM's own reasoning process. It's like giving a brilliant student a pile of notes without teaching them how to pick the most relevant ones for a new problem.

AdaptRec proposes a "self-adaptive prompting framework" to bridge this gap. The core idea is to make the LLM an active participant in selecting and utilizing collaborative information. It's a multi-stage process designed to be more dynamic and aligned with how LLMs learn.

The Challenge: Making LLMs Understand Collaborative Filtering Intuitively

Traditional sequential recommenders (GRU4Rec, SASRec, etc.) excel at capturing patterns from user-item interaction sequences. LLM-based approaches try to do this by converting interaction sequences into textual prompts. The paper categorizes current LLM4SRec prompt strategies (Figure 1 in their paper is a great visual for this):

User-Agnostic Prompts: General instructions to the LLM on how to recommend, without specific user history. (e.g., "Recommend a romantic movie from this list...")
Single User-Specific Prompts: Focus on the target user's own history, either "isolated" (just their sequence) or "implicit" (using pre-learned embeddings of their history within the prompt).
Multi-User Collaborative Prompts: The emerging area where sequences from other, similar users are incorporated as demonstrations. This is where AdaptRec operates.

Overview of Prompt Design Strategies for Sequential Recommendation Systems: User-Agnostic Prompt, Single User-Specific Prompt, and Multi-User Collaborative Prompt.

The problem with current multi-user prompts, as AdaptRec identifies, includes:

Assessing Demonstration Quality: How do you know if a similar user's sequence is actually a good example for the LLM to learn from for this specific target user and this specific recommendation task?
Handling Large Search Spaces: Finding the "best" few similar users from millions is computationally intensive.
Adapting Demonstration Selection: Similarity isn't static. As the LLM learns or as user preferences evolve, the "best" similar users might change. Existing methods often use a fixed set of similar users.

AdaptRec's Self-Adaptive Framework: A Three-Stage Approach

AdaptRec tackles these challenges with an iterative, three-stage framework (beautifully illustrated in their Figure 2):

The Self-Adaptive User-Contextualized Sequential Recommendation Framework. (1) User Similarity Retrieval extracts relevant sequences. (2) Self-Adaptive User Selection refines similar user pool. (3) Contextual Prompt-based Recommendation generates personalized suggestions. An iterative feedback mechanism continuously improves user selection and recommendation accuracy.

Stage 1: User Similarity Retrieval (Coarse Filtering)

Problem: LLMs have token limits. You can't feed in thousands of user histories.
Solution: Start with a traditional, efficient CF method. They use cosine similarity on item sequence title embeddings to find an initial Top-N set of potentially similar users. This prunes the vast search space.
Equation (2):
sim(v, u) = (e_v ⋅ e_u) / (||e_v|| ||e_u||) where e_v, e_u are sequence embeddings for target user v and candidate user u.

Stage 2: Self-Adaptive User Selection (LLM-Powered Refinement)

Problem:
The Top-N from Stage 1 are just numerically similar. Are they
demonstratively useful for the LLM?
Solution: This is the clever bit. They introduce a User-based Similarity Retrieval Prompt (Figure 3 in the paper).

This prompt essentially asks the LLM: "Given the target user's watch history and the watch history of these N candidates, rank the candidates by similarity (from the LLM's perspective of how well they'd serve as an example)."

The LLM (using its understanding of sequence patterns, item semantics, etc.) re-evaluates these N users and selects a smaller Top-M subset (U₂).
Key Idea: The LLM actively participates in choosing its own demonstration examples. This selection process isn't static; as the main recommendation model is fine-tuned (using LoRA for parameter efficiency), its understanding of what constitutes a "good similar user" can evolve, leading to better demonstration selection over time (hence "self-adaptive").

Stage 3: Contextual Prompt-based Recommendation (LLM Does its Magic)

Problem: How to actually use these M refined similar user histories for the target user's recommendation?
Solution: They construct a User-Contextualized Recommendation Prompt (Figure 4 in the paper)

This prompt includes:

Demonstrations from the M similar users: "{Similar user M} has watched... Based on this, she/he chose {next item}." These are translated into natural language.
The target user's history: "#Task: The {Target user} has watched... Please recommend the next movie for this user..."
The LLM is then fine-tuned (using LoRA) to predict the next item i_{n+1} based on this rich, collaboratively-informed prompt, minimizing a standard negative log-likelihood loss.
Equation (6):
P(i_{n+1,t} | [i_1, ..., i_n], {H_u}_{u∈U₂}, i_{n+1,<t}) – probability of the next item token given target user history, similar user histories, and previously generated tokens of the next item.

The "Self-Adaptive" Loop: The crucial part is that the LLM used for Self-Adaptive User Selection (Stage 2) and the LLM used for Contextual Prompt-based Recommendation (Stage 3) are essentially the same underlying model being fine-tuned. As the model gets better at recommendations (Stage 3), its ability to judge which users are truly "similar" in a way that's helpful for its own reasoning process (Stage 2) also improves. This creates an iterative refinement loop.

Experimental Setup and Results (Section 5)

AdaptRec was evaluated on MovieLens100K, LastFM, and GoodReads datasets against:

Traditional Sequential Recommenders: GRU4Rec, Caser, SASRec.
LLM-based Models:
- Vanilla LLMs: Llama2-7B, GPT-4 (zero-shot).
- Specialized LLM Recommenders: MoRec (encodes item modality), LLaRA (hybrid prompting with item textual/behavioral signals).
- AdaptRec also uses Llama2-7B as its base.

Key Findings (RQ1 - Overall Performance, Table 2):

Performance comparison of different methods." Highlight AdaptRec's scores, especially HR@1, and the "Improv." row.

AdaptRec consistently outperforms all baselines across HR@1, NDCG@5, and NDCG@20 on all three datasets.
HR@1 Improvements over the next best method:
- MovieLens: +7.13% (AdaptRec 0.4736 vs LLaRA 0.4421)
- LastFM: +18.16% (AdaptRec 0.5327 vs LLaRA 0.4508)
- GoodReads: +10.41% (AdaptRec 0.4432 vs LLaRA 0.4014)
Traditional models lag significantly, highlighting the benefit of LLMs' semantic understanding.
Vanilla LLMs (Llama2, GPT-4) show limitations, especially Llama2 in controlled generation (low ValidRatio). Specialized LLM recommenders (MoRec, LLaRA) improve stability but are still beaten by AdaptRec. This suggests that simply applying LLMs or doing basic LLM enhancement isn't enough; how collaborative signals are integrated matters.

Ablation Studies & Deeper Insights:

Effectiveness of User Similarity Retrieval (RQ2, Table 3): Comparing AdaptRec's coarse-grained retrieval (Stage 1) against random user sampling shows massive drops in performance if Stage 1 is omitted (e.g., HR@1 on MovieLens drops from 0.4736 to 0.3224). Simply increasing candidate pool size without smart filtering is detrimental.

The results of ablation study on AdaptRec," showing rows for "w/o retrieval", "w/o self-adaptive", "w/o demo".

Impact of Self-Adaptive User Selection (RQ3, Figure 7): Comparing Self-Adaptive selection (Stage 2) against just randomly sampling from the retrieved pool ("Static Selection") shows consistent improvements for Self-Adaptive across different numbers of demonstrations (M=1,3,5,7,9). The LLM choosing its own examples is better than random choice from a pre-filtered set.

Performance comparison of static and adaptive selection strategies on MovieLens dataset.

Impact of User-Based Contextual Prompts (RQ4, Table 4 & Figure 6):

Introducing demonstrations (even just M=1) significantly boosts performance over a baseline without any demonstrations.
Performance exhibits an inverted U-shape with the number of demonstrations (M). For these datasets, M=5 demonstrations was optimal. Too few, and there's not enough collaborative signal. Too many, and the LLM might get confused by excessive or noisy contextual information, hindering its ability to focus on the most relevant patterns. HR@1 drops sharply when increasing demos from 5 to 7 on LastFM and MovieLens.

Impact of varying demonstration numbers on HR@1 across three datasets" AND Figure 6 from arXiv:2504.08786v1 - "Performance with Different Numbers of Demonstrations across three datasets.

Few-Shot Performance (GPT-4 without fine-tuning, Table 5): AdaptRec's User-Contextualized Prompt (UCP), even without fine-tuning (using GPT-4 zero-shot with UCP), significantly outperforms Basic Recommendation Prompts (BRP) and Chain-of-Thought (CoT) prompts.

MovieLens: UCP (0.2460 HR@1) vs CoT (0.2240) vs BRP (0.2000) -> +23% over BRP, +9.82% over CoT.
This shows the strength of providing explicit, relevant collaborative examples even for powerful, un-tuned LLMs.

Performance comparison of UCP, CoT, and BRP across datasets.

Illustration of User-Contextualized Prompt, Chain-of-Thought Prompt and Conventional prompt

The qualitative analysis in Figure 8 is particularly insightful, showing how UCP can lead to non-obvious, cross-genre recommendations (e.g., recommending Lord of the Rings after Titanic, Notebook, Pride and Prejudice) because similar users exhibited that pattern, whereas BRP sticks to genre matching and CoT might over-rationalize.

Key Strengths & Implications of AdaptRec:

LLM-in-the-Loop for Demonstration Selection: This is the core novelty. Instead of just feeding pre-selected similar users, the LLM actively refines this selection. This makes the collaborative signal more "digestible" and aligned with the LLM's reasoning.
Dynamic and Adaptive: The framework allows the model's understanding of "similarity" to evolve during training, potentially leading to more nuanced and effective collaborative filtering over time.
Strong Empirical Results: The significant improvements over both traditional methods and other LLM-based approaches, especially in HR@1 (which is critical for many RecSys applications), are compelling.
Effectiveness in Few-Shot Scenarios: The UCP prompting strategy shows value even when full fine-tuning isn't feasible, making it relevant for quickly leveraging powerful foundation models.
Addressing a Core LLM4SRec Challenge: It provides a structured and effective way to inject explicit collaborative signals into LLMs, which often struggle with the implicit, ID-based nature of traditional CF.

Some Limitations Acknowledged by Authors:

Reduced effectiveness with multilingual content (seen on MovieLens with its diverse titles).
Computational overhead of the iterative self-adaptive framework (though LoRA helps).
ValidRatio (ensuring LLM generates items from a candidate set) is high (0.96) but not perfect like traditional models.

Final Thoughts

AdaptRec presents a thoughtful and effective step forward in making LLMs truly collaborative recommenders. By empowering the LLM to guide the selection of its own demonstration examples from similar users, the framework fosters a tighter alignment between the rich signals of collaborative filtering and the powerful reasoning capabilities of LLMs. The results, particularly the HR@1 gains and the strong few-shot performance, suggest this self-adaptive, context-aware prompting is a promising direction for the next generation of sequential recommendation systems. It moves beyond simply using LLMs as sequence encoders or text processors and starts to leverage their potential as active reasoning agents within the recommendation loop.

For those working on LLM-powered recommendations, the idea of making the LLM itself a part of the data selection/curation process for its own few-shot learning or fine-tuning is a powerful concept to explore.

‍

Bringing Collaborative Filtering to LLMs with AdaptRec

The Challenge: Making LLMs Understand Collaborative Filtering Intuitively

AdaptRec's Self-Adaptive Framework: A Three-Stage Approach

Experimental Setup and Results (Section 5)

Key Findings (RQ1 - Overall Performance, Table 2):

Ablation Studies & Deeper Insights:

Key Strengths & Implications of AdaptRec:

Some Limitations Acknowledged by Authors:

Final Thoughts

Get up and running with one engineer in one sprint

Related Posts

RAG for RecSys: a magic formula?

Why Airbnb Made Such a Big Deal About Categories

Beyond Static Preferences: Understanding Sequential Models in Recommendation Systems (N-Gram, SASRec, BERT4Rec & Beyond)