Recommendation systems are fundamental to modern digital platforms. They curate the vast content users encounter daily on social media, streaming services, and e-commerce sites.
These systems engage users by providing personalized experiences that align with their interests and behaviors. However, managing and optimizing these systems for platforms with over 1 billion members globally presents monumental challenges. The sheer scale necessitates sophisticated techniques not only in model architecture but also in efficient training, compression, and deployment to production.
The delicate balance between relevance and novelty lies at the core of large-scale recommendation systems. Traditional methods often rely heavily on past user behavior, which, while ensuring relevance, can reinforce existing preferences and create feedback loops that limit exploration.
Introducing novel content is vital for long-term engagement but risks frustrating users with irrelevant suggestions. Advanced ranking systems are designed to navigate this trade-off, offering personalized content while introducing fresh, engaging experiences.
Recent advancements have seen the widespread adoption of deep learning (DL) and neural network models, which enable more complex and nuanced recommendations by processing vast amounts of data and capturing intricate patterns. Simultaneously, the emergence of large language models (LLMs) has opened new avenues for exploration, leveraging their world knowledge to suggest novel content beyond users' established preferences.
Let’s explore how these advancements are reshaping feed ranking, exploring the key techniques, practical deployment lessons, and opportunities and challenges they bring, particularly as demonstrated by LinkedIn's large-scale LiRank framework and recent work on LLM-powered exploration.
The Evolution of Ranking Models
Feed ranking systems have evolved significantly from their simpler origins. Early methods like collaborative filtering (recommending based on similar users' preferences) and content-based recommendations (recommending based on item attributes matched to user preferences) were foundational. Still, they struggled with scale, cold-start problems, and limited diversity.
The shift to deep learning allowed for models that could handle more data, automatically extract features, and capture complex interactions. A pivotal moment was the success of the Wide&Deep model in 2016, which combined a generalized linear model to capture explicit feature interactions with an MLP network for implicit interactions. This spurred subsequent research focused on enhancing feature interaction capabilities.
LinkedIn's experience with large-scale ranking models, captured in the LiRank framework, involved experimenting with and customizing various advanced architectures for various tasks, including Feed ranking, Job Recommendations, and Ad CTR prediction.

This process involved substantial effort and painstaking trial-and-error to integrate state-of-the-art (SOTA) architectures into production effectively. Models like DeepFM, DCN, DCNv2, xDeepFM, AutoInt, AFN, and FinalMLP were explored. Among these, DCNv2 proved to be the most versatile for LinkedIn's recommender tasks, leading to the development of Residual DCN as an enhancement.
Key Techniques in Modern Feed Ranking (LiRank)
The LiRank framework integrates several deep learning-based techniques to personalize content at massive scale.
Residual DCN and Attention Layers for Feature Interactions
LinkedIn utilized DCNv2 to automatically capture feature interactions. However, DCNv2 added many parameters, especially with large feature input dimensions.
LinkedIn adopted two strategies to enhance efficiency: replacing the weight matrix with "skinny matrices" resembling a low-rank approximation and reducing input feature dimension by replacing sparse one-hot features with embedding-table look-ups, resulting in a nearly 30% reduction.
To further improve DCNv2's power, Residual DCN introduced an attention schema in the low-rank cross net. The original low-rank mapping is duplicated for query, key, and value matrices, and an attention score matrix is computed and inserted.
Adding a skip connection was beneficial for learning more complicated feature correlations while maintaining stable training. Paralleling a low-rank cross net with an attention low-rank cross net produced a statistically significant improvement on the feed ranking task.

Isotonic Calibration Layer in DNN Model
Accurate model calibration, ensuring estimated probabilities align with real-world occurrences, is crucial for business success and impacts things like Ads charging prices.
Traditionally, calibration (using methods like Histogram binning, Platt Scaling, and Isotonic Regression) is performed post-processing after model training. These methods have limitations for deep neural networks, including parameter space constraints and scalability issues when incorporating multiple features.
LinkedIn addressed this by developing a customized isotonic calibration layer that can be used as a native neural network layer and is co-trained jointly within the deep learning model.
This layer bucketizes predicted values (converted back to logits) and assigns trainable non-negative weights for each bucket, which are updated during training. Weights can be combined with embedding representations derived from calibration features to enhance power with multiple features.
Directly incorporating this layer improves model predictive accuracy significantly and ensures more accurate predictions in production.
Embedding Optimization: Quantization and Vocabulary Compression
Embedding tables, often comprising over 90% of a large model's size, can become bottlenecks due to large vocabulary and embedding dimensions, leading to storage and inference issues.
Vocabulary compression methods are employed to reduce memory usage. The traditional approach uses static hashtables to map string IDs to integers, which can consume up to 30% of memory and are inefficient with continuous training requiring updates.
QR hashing offers a solution by decomposing large matrices into smaller ones using quotient and remainder techniques, preserving embedding uniqueness. This significantly reduces the number of rows needed.
QR hashing is compatible with collision-resistant hashing like MurmurHash, potentially eliminating vocabulary maintenance and resolving the Out-of-Vocabulary (OOV) problem by generating embeddings for every training item ID. This technique achieved a 5x reduction in the number of model parameters for Jobs Recommendations without performance loss.
Embedding table quantization reduces embedding precision and overall model size. For example, 8-bit row-wise min-max or middle-max quantization can reduce table size by over 70% while maintaining performance and inference speed without extra training or calibration data.
Middle-max quantization is preferred as embedding values often follow a normal distribution, concentrating values in the middle and reducing quantization errors. It also ensures reversible integer casting operations. This technique led to a +0.9% CTR relative improvement in Ads CTR prediction, potentially due to smoothing decision boundaries and improving generalization.

Explore/Exploit with Bayesian Methods
Balancing exploration (recommending novel content) and exploitation (leveraging past behavior) is a fundamental challenge. Traditional methods like Upper Confidence Bounds (UCB) and Thompson sampling are difficult to apply efficiently to large deep neural network models.
LinkedIn tackled this using a method similar to the Neural Linear approach. It performed Bayesian linear regression on the weights of the last layer of a neural network. The posterior probability of these weights is acquired and fed into Thompson Sampling.
These updates are done incrementally at the end of each offline training period, allowing the timely capture of new information. This approach has helped manage the exploration vs. exploitation dilemma, improving long-term user engagement.
Multi-task Learning (MTL)
MTL is crucial for simultaneously optimizing various ranking criteria, such as user engagement metrics, content relevance, and personalization. LinkedIn explored multiple architectures, including Hard Parameter Sharing, Grouping Strategy, MMoE, and PLE.
The Grouping Strategy, where tasks are grouped based on similarity (e.g., 'Like' and 'Contribution' together), showed modest improvement with only slightly increased parameters.
While MMoE and PLE offered significant performance boosts (+1.19%, +1.34% Contributions respectively), they increased parameter count substantially (3x-10x), posing challenges for large-scale online deployment. The Grouping Strategy contributed +0.75% Contributions.
Dwell Time Modeling
Understanding how long members interact with content provides valuable insights. LinkedIn introduced a 'long dwell' signal to detect passive, positive engagement. Challenges included noisy data, difficulty setting static thresholds, and potential bias towards content with longer dwell times.
The solution was a binary classifier predicting if dwell time exceeded a context-dependent percentile (e.g., 90th percentile), determined based on contextual features like ranking position and content type.
This approach operates within a Multi-task multi-class framework, resulting in relative improvements in overall time spent (+0.8%), time spent per post (+1%), and member sessions (+0.2%).
Member History Modeling
Modeling member interactions with platform content uses historical interaction sequences. Item embeddings are learned and concatenated with action embeddings and the embedding of the currently scored item (early fusion).
A Transformer-Encoder processes this sequence, using the max-pooling token as a feature. The last five sequence steps can also be flattened and concatenated as additional features. Ablation studies showed gains from adding encoder layers (largest from zero to one), increasing feedforward dimension, and increasing sequence length.
The optimal latency configuration was two encoder layers, feedforward dimension 1/2x embedding size, and sequence length 50, referred to as TransAct. This technique contributed +1.66%.
Balancing Novelty and Relevance with Large Language Models (LLMs)
While LiRank's deep learning techniques significantly improved relevance, the balance with novelty remained challenging. LLMs offer potential here by leveraging their world knowledge and reasoning capabilities to generate novel and relevant recommendations.
A recent thesis entitled “User Feedback Alignment for LLM-powered Exploration” proposes a novel approach combining hierarchical planning with LLM inference-time scaling. This method decouples novelty and user alignment into two specialized LLM models: a novelty model and an alignment model.
The novelty model generates diverse, potentially novel, content suggestions. The alignment model is trained specifically to rate the novelty model's predictions based on observed user feedback, ensuring the content remains relevant and engaging. This allows for the independent optimization of each objective.

Collective User Feedback Alignment
Recommendation systems face challenges in effectively using real-world human feedback, as they rely on noisy implicit signals (like clicks, dwell time) rather than explicit comparative judgments. The alignment model is trained using collective user feedback gathered from live traffic interactions with LLM-powered recommendations to address this.
Feedback (e.g., positive playback, like, share, skip) is logged for each predicted cluster and the user's history. This feedback is then aggregated for each ({C1, ..., CK}, Cn) pair (where {C1, ..., CK} is the user's historical cluster sequence and Cn is the predicted novel cluster) to generate an aggregated feedback score L(1,k),n.

Aggregating signals across user clusters helps reduce noise and bias. The aggregated feedback is post-processed (normalized, filtered, rounded) to create training examples.
Both pointwise ({C1, ..., CK}, Cn , L(1,k),n) and pairwise (ranking different Cn for a sequence by score) training examples were tested for training the alignment model. The alignment model, an LLM with a linear projection layer, is trained to score predictions using a cross-entropy loss against the aggregated user engagement.
In live experiments, the pointwise model slightly outperformed the pairwise model and was 2x faster to train, leading to its deployment.
Inference Scaling with Best-of-N User Alignment
The work introduced inference-time scaling to improve the system's ability to generate relevant novel predictions. Instead of just taking one prediction from the novelty LLM, it repeatedly and independently samples multiple predictions (e.g., 5 times more) from the novelty LLM using a high temperature setting to encourage diversity.
The separate alignment model then acts as a selector, scoring these candidates based on user preference and choosing the best-of-n.
Crucially, this scaling, scoring, and selection happen offline, and the cost is amortized across bulk inference runs. This ensures no additional latency impact on online serving, critical for large-scale real-time systems and addresses the Queries Per Second (QPS) challenge.
This dual-LLM setup avoids the challenge of teaching one model both novelty and relevancy, which can be competing objectives.
By reflecting on the novelty prediction using an LLM aligned with user feedback, the system improves exploration efficiency by demoting predictions that are less likely to satisfy users.
The approach combines the personalization capabilities of traditional recommenders (constrained to items within the novel cluster) with the LLM's novelty-seeking behavior.
Deployment Lessons
Deploying large-scale ranking models comes with significant practical challenges. LinkedIn shared several key lessons learned:
Scaling Training Data Generation
Scaling the Feed training data pipeline from 13% to 100% of sessions caused significant delays due to a join between post labels and features.
The solution involved optimizing the join: exploding only post features/keys, joining with labels, then adding session features in a separate, smaller join, which reduced runtime by 80%. Tuning Spark compression yielded an additional 25% reduction.
Model Convergence
Adding DCNv2 introduced model training divergence issues. These were resolved by increasing the learning rate warm-up (from 5% to 50% of training steps), applying batch normalization to numeric inputs, and using higher learning rates to compensate for fixed training steps when under-fitting was observed.
Different optimizers were needed for specific models; AdaGrad was crucial for models with numerous sparse features, where Adam was ineffective. A generalized learning rate warm-up for larger batch sizes also improved generalization.
Serving Large Models with Memory Constraints
Constrained memory on serving hosts initially hindered deploying multiple large models. An initial external serving strategy for ID embeddings had iteration flexibility and staleness issues. Transitioning to in-memory serving improved engagement metrics and reduced operational costs.
This was enabled by upgrading hardware, tuning garbage collection, and optimizing memory consumption through quantization and ID vocabulary transformation. Minimal Perfect Hashing Function (MPHF) in TF Custom Ops initially reduced vocab lookup memory by 100x but slowed training 3x.
Hashing strings to int32 using Apache Commons Codec and fastutil map implementation reduced heap size by 93% without training degradation. The latter QR hashing approach successfully eliminated the static hash table without performance drops.
Results and Impact
Integrating deep learning techniques (LiRank) and LLMs has driven significant tangible results at LinkedIn.
The LiRank framework's advancements, including Residual DCN and other modeling techniques, have led to notable improvements across LinkedIn's core products:
- +0.5% member sessions in the Feed.
- +1.76% qualified job applications for Jobs search and recommendations.
- +4.3% for Ads CTR.
- Incremental training resulted in metric boosts and a 96% reduction in training time for both Feed ranking and Ads CTR models.
- Model parallelism reduced training time from 70 to 20 hours (71% reduction).
- The custom Avro Tensor Dataset Loader reduced e2e training time by 50%.
- Offloading last-mile transformations asynchronously reduced e2e training time by 20%.
- Prefetching datasets to GPU reduced e2e training time by 15%.
For LLM-powered exploration, the user feedback alignment approach demonstrated a measurable impact on user experience and engagement. By balancing exploration and relevance through the dual-model system and inference-time scaling:
- Live A/B tests observed an increase in user interests and engagement metrics.
- The system simultaneously achieved high novelty and user satisfaction, described as a "rare combination".
- This resulted in a "significantly improved operating curve for user interest exploration" by ensuring content is new and aligned with user preferences.
Future Directions and Challenges
As large-scale ranking models evolve, overcoming key challenges is essential, particularly regarding scaling efficiency, continuous adaptation, maintaining the novelty/relevance balance, and addressing ethical concerns.
- Scaling Models for Efficiency: As models grow more complex, especially with LLMs, further innovation is needed to balance accuracy with computational efficiency. Optimizing training infrastructure and improving model serving speeds are critical.
- Continuous Learning and Adaptation: User preferences are dynamic, requiring models to adapt continuously. While incremental training is useful, future systems will require more advanced lifelong learning techniques that can adjust in near real-time.
- Balancing Novelty and User Preferences: The exploration vs. exploitation dilemma persists. While LLMs are powerful for novelty, fine-tuning them to handle complex feedback and independently optimize novelty and relevance in real-time is a future challenge.
- Ethics, Bias, and Fairness: Integrating fairness-aware training and diversity-promoting strategies is necessary to ensure equitable and inclusive recommendations and avoid perpetuating stereotypes.
- Multimodal and Multidimensional Models: Future systems will likely integrate text, images, video, and other modalities, requiring models combining LLMs with vision and audio processing for richer, more personalized recommendations.
By addressing these challenges, large-scale personalized recommendation systems can continue to enhance user experience and drive engagement effectively across increasingly diverse content and user bases.