Thanks to Qiang Chen for the review.
Open source simulation repo, Paper pre-print
A recent paper offers a compelling case for rethinking a fundamental aspect of video recommendation systems, particularly for platforms where "time spent" is the ultimate currency. If your work involves keeping users engaged and current CTR-focused models seem to be missing the broader picture of true user value, then the research from Tubi, "Tweedie Regression for Video Recommendation System" by Yan Zheng, Qiang Chen, and Chenglei Niu warrants close examination. This isn't merely an incremental model tweak; it's an argument for revisiting how the ranking problem itself is defined.
For many Video-on-Demand (VOD) services, especially ad-supported ones (AVOD) like Tubi, the primary objective extends beyond simple clicks. While CTR provides a valuable signal, the ultimate economic driver, and arguably a more accurate proxy for deep user engagement and satisfaction, is maximized watch time. Increased watch time translates to more ad inventory, richer data for preference learning, and an indication that users are genuinely deriving value. The challenge, as Tubi's team articulates, is that much of current industry practice defaults to treating recommendation ranking as a classification task: predicting click-through rates (CTR) and optimizing via LogLoss. This can create an inherent disconnect with the core business goal. Tubi's paper proposes a direct, statistically robust solution: reframe video ranking as a regression problem focused on directly predicting user viewing time, and critically, employ a loss function that accurately models the unique, often challenging, distribution of watch time data – the Tweedie loss.
The Quirky Nature of Watch Time: Why Common Losses Stumble
Before delving into the specifics of Tweedie regression, it's important to understand why predicting watch time is a non-trivial task. Plotting raw watch times for items in a recommendation feed typically reveals a distinct statistical signature:
- A Mountain at Zero:
- A significant fraction of recommended items are never clicked, or if clicked, are abandoned almost immediately. This results in a large probability mass concentrated at
watch_time = 0
. - A Skewed Landscape for the Watched: For videos that do capture engagement, the watch duration is a continuous, positive variable. However, this distribution is rarely symmetric. Shorter watch times are common, moderate durations less so, and very long watch times form a characteristic long tail.
- Positivity Constraint: Watch time, by definition, cannot be negative.
This profile makes standard loss functions problematic:
- Mean Squared Error (MSE): Assumes Gaussian errors, an assumption clearly violated by the zero-inflation and extreme skew. It's also overly sensitive to outliers in the long tail of watch durations.
- LogLoss (for CTR): This is fundamentally a classification loss. While one can weight samples by watch time (a common heuristic, and indeed Tubi's control group baseline), the model is still indirectly optimizing for duration by proxy of a binary click event, rather than directly learning to predict the continuous watch time value.
This is where the Tweedie distribution provides a more appropriate statistical framework. It's a member of the powerful Exponential Dispersion Model (EDM) family, uniquely capable of modeling data that is simultaneously zero-inflated and positively skewed on the continuous scale.

The intuition behind the Tweedie distribution often involves a compound Poisson-Gamma process:
- Imagine a Poisson process that determines M, the number of discrete "engagement events" or "interest segments" a user experiences with a piece of content. If M is zero (no events occur), the watch time is zero.
- If M > 0, then for each of these M events, there's an associated "intensity" or "duration"
C_i
, drawn from a Gamma distribution. - The total watch time X is then the sum of these M Gamma-distributed durations.
The versatility of the Tweedie distribution is controlled by a power parameter p:
p = 0
: Degenerates to a Normal distribution.p = 1
: Becomes a Poisson distribution (suitable for count data).1 < p < 2
: This is the key range for modeling watch time. It represents the compound Poisson-Gamma process, exhibiting that crucial mass at zero and continuous, skewed positive values.p = 2
: Becomes a Gamma distribution (continuous positive, but no specific mass at zero).p > 2
: Represents other, more extreme, heavy-tailed distributions.
The Tubi team, through pre-analysis (detailed later), determined that p ≈ 1.5 provided the best empirical fit for their real-world watch time data. This value places it squarely in the compound Poisson-Gamma regime. Their paper includes a compelling visualization of this alignment:

The Tweedie loss function is derived by maximizing the likelihood of observing the data, assuming it follows a Tweedie distribution. For a true watch time y and a predicted mean watch time ŷ, the loss (ignoring scaling constants and specific to the power parameter p) is: Loss_Tweedie(y, ŷ; p) = -y * (ŷ^(1-p) / (1-p)) + (ŷ^(2-p) / (2-p))
When p=1.5
, this formula simplifies to: Loss_Tweedie(y, ŷ; p=1.5) = 2y * ŷ^(-0.5) - 2 * ŷ^(0.5)
The model's objective then becomes to predict ŷ such that this loss is minimized.
Setting the Stage: How This Fits into the Broader ML Landscape (Section II)
Tubi's work is well-grounded within existing research areas:
- Tweedie's Track Record in Insurance (II.A): The insurance industry has a long history of using Tweedie regression to model claim amounts. Insurance claims often exhibit similar characteristics to watch time: many policyholders file zero claims (the mass at zero), while those who do file claims have amounts that are positive and skewed. This cross-domain applicability lends credibility to the choice of distribution.
- Evolution of Learning to Rank (LTR) (II.B): The paper acknowledges the progression of LTR techniques in recommendations, from pointwise methods (like LogLoss for CTR) to pairwise and listwise approaches, and the development of more sophisticated loss functions incorporating Information Retrieval metrics like NDCG. They specifically cite Covington et al. (Google, 2016), whose approach of weighting LogLoss samples by watch time serves as an important and sophisticated baseline for implicitly optimizing viewing duration. Tubi's work aims to demonstrate that a direct regression approach can be superior. They also highlight a specific gap: while short-form video platforms have explored watch time prediction, long-form VOD (Tubi's domain of movies and series) has seen less explicit focus on regression using loss functions tailored to the data's unique structure.
- The Multi-Interest Paradigm (II.C): Research into modeling users with multiple, potentially diverse interests (e.g., using multi-head attention mechanisms in deep learning models) offers a conceptual link. The compound nature of the Tweedie process (summing multiple "Gamma" distributed events) can be interpreted as an implicit way to capture how different facets of a user's interest profile (or even different users on a shared device) contribute to the overall watch time for a single content item.
The "Why": Theoretical Justifications (Section III)
This section of the paper delves into the mathematical and conceptual reasoning that underpins the choice of Tweedie regression.
- Head-to-Head: Tweedie Loss vs. Weighted LogLoss (III.A)
- This is arguably the most critical theoretical argument presented. The authors conduct an analytical comparison of their proposed Tweedie loss (with
p=1.5
) against the strong baseline of viewing-time-weighted LogLoss. Let y_i represent the true watch time andŷ_i
the model's prediction (interpreted as a click probability for LogLoss and a predicted mean watch time for Tweedie, assumed to be normalized to a comparable range for this analytical comparison). The paper employs Taylor series expansions for terms likeln(ŷ_i)
(relevant to LogLoss) andŷ_i^(-0.5)
(relevant to the Tweedie loss withp=1.5
). For positive samples (where a click occurs, causing theln(1-ŷ_i)
term in LogLoss to vanish), their analysis (Equation 6 in the paper) suggests the following approximate forms:TLoss ~ y_i * [-f(ŷ_i) - f(ŷ_i)^2 - O(f(ŷ_i)^3)] CLoss (weighted) ~ y_i * [-f(ŷ_i) - f(ŷ_i)^2/2 - O(f(ŷ_i)^3)]
(wheref(ŷ_i) = 1 - sqrt(ŷ_i)
, a transformation of the predictionŷ_i
).
The key difference emerges in the coefficient of the second-order termf(ŷ_i)^2
. For Tweedie loss, this coefficient is -1, while for weighted LogLoss, it's -1/2. This mathematical distinction implies that whenŷ_i
is less than 1 (as probabilities or normalized values typically are), Tweedie loss exhibits greater asymptotic sensitivity to prediction errors. It penalizes deviations from the true watch time more aggressively, particularly as the prediction error increases. My interpretation of this is that the heightened sensitivity allows a model trained with Tweedie loss to more effectively utilize the rich feature information (x) to learn the continuous nuances of watch time. The model is directly tasked with outputting a continuous value and is judged more stringently and appropriately for its accuracy in doing so, unlike LogLoss which remains fundamentally a classification loss merely weighted by a continuous external factor.
- A Framework for Decomposing Losses (III.B)
- This part of the paper is more conceptual and forward-looking. The authors propose a theoretical framework wherein different online business metrics (such as
Watch_Duration
andConversion
, as exemplified in their Equation 7) can be considered as residing in distinct subspaces within a larger Hilbert space. A model's loss function (L_g
in Equation 8) could, theoretically, be decomposed using a suitable basis (like Taylor series termsf_n(x) = x^n
) into a set of coordinates(C_1g, C_2g, C_3g, ...)
. The ambitious idea presented is that if one could empirically map these loss function coordinates to observed online A/B test lifts for specific metrics (as suggested by Equation 9:Metric_Watch_Duration,g = c_g^T ⋅ t
), one could then solve for the "projection coefficients" (t for watch duration, v for conversion). These coefficients would quantify how sensitive each business metric is to changes in the "shape" or characteristics of the loss function. Armed with these coefficients, a new, composite loss function could be engineered to specifically maximize a single, target business objective (e.g., total viewing time) by appropriately weighting the contributions of different elemental loss characteristics. This represents an advanced concept aiming for a highly principled method of multi-objective optimization or for crafting a loss function acutely tailored to one key metric. While not implemented in their main reported results, it indicates a sophisticated direction for future research in loss function engineering.
- Revisiting the Multi-Interest Analogy (III.C & Table I) The paper reinforces the rationale for choosing Tweedie by drawing a clear and insightful analogy between actuarial science claim modeling and recommendation system viewing time, an analogy particularly relevant for VOD platforms.
This conceptual alignment is especially pertinent for common VOD scenarios, such as Over-The-Top (OTT) devices (e.g., smart TVs) shared by multiple household members with varying tastes, or a single long-form video (like a movie) that might cater to several distinct interests of a single viewer (e.g., satisfying interests in action, science fiction, and a favorite actor simultaneously). The Tweedie process offers a natural mathematical framework for modeling this aggregation of engagement.

Experiments and Real-World Validation (Section IV)
This section details the empirical work where the theoretical advantages of Tweedie regression are put to the test, yielding compelling results.
Controlled Environment: User Simulation on Synthetic Data (IV.A) To isolate the impact of the loss function and ensure the reproducibility of their findings, the Tubi team first conducted an extensive user simulation.
Simulation Assumptions (Key Points):
- Each title within the simulation was assigned a predefined, static click probability, drawn from a normal distribution.
- Users were modeled as "featureless" – meaning no individual user profiles or historical features were used. This simplification helps to focus the evaluation purely on the item-level prediction capabilities as influenced by the different loss functions, rather than on the complexities of user modeling.
- Titles were endowed with an inherent "completion intention" probability, as well as distinct completion rate distributions for users classified as "intenders" (those likely to finish) versus "non-intenders."
- Each title was assigned a unique duration, sampled from a normal distribution.
- User browsing behavior was simulated by having users scroll through a ranked list with a certain probability of stopping (abandoning the session) at any given point.
Simulation Protocol & Settings:
- The simulation spanned a 13-day period, involving 10,000 simulated users and 1,000 unique titles.
- Days 1-3: Users were presented with a "human-edited" (fixed) ranking. This phase served to collect initial interaction data.
- Days 4-13: The various models under test were responsible for generating the rankings presented to users. Crucially, these models were retrained daily using the newly collected user feedback from the previous day. This iterative retraining process is vital for allowing the models to adapt to temporal dynamics and learn from the consequences of their own previous recommendations.
- Training parameters: Learning rate of 1e-3, 100 epochs per daily training session, with data shuffling employed to mitigate potential biases arising from the order of presentation.
Model Architecture & Loss Functions Compared:
- To ensure a fair comparison, a consistent, relatively simple neural architecture was used for all models:
Title ID -> 16-dimensional Embedding -> FullyConnected(16->8) -> FullyConnected(8->1) -> Scalar Output
. - This scalar output was then interpreted differently based on the specific loss function being evaluated:
- Pointwise Model (Click Optimization):
- Optimized using standard Logistic Loss
(LogLoss[ŷi] = -[y'i ln(ŷi) + (1-y'i)ln(1-ŷi)])
, wherey'i
is the binary click label. - Watch-Duration Weighted Model:
- Used the same LogLoss, but each clicked sample (y'i=1) was weighted by its total actual watch duration
y_i
. This represents a strong and commonly used baseline. - Regression Model:
- Optimized using Mean Squared Error
(MeanSquaredError[ŷi] = (ŷi – yi)^2)
, directly predicting continuous watch timey_i
. - Tweedie Regression Model:
- Optimized using the proposed Tweedie Loss
(TweedieLoss[yi] = -yi * (ŷi^(1-p)/(1-p)) + (ŷi^(2-p)/(2-p)))
withp=1.5
, also directly predicting continuous watch time.
Simulation Results: Each of the four models was executed 10 times to ensure the stability and reliability of the observed results.

The Tweedie Regression Model consistently yielded the highest mean watch duration across the simulation runs.

The paper reported statistically significant lifts in total watch duration for the Tweedie model when compared to the others:
- Approximately +21% vs. the Pointwise (LogLoss) model.
- Approximately +20% vs. the Watch-Duration Weighted LogLoss model.
- Approximately +10% vs. the Regression (MSE) model. This simulation phase provided strong initial evidence suggesting that direct regression with Tweedie loss is more effective for the explicit goal of maximizing watch duration than both standard CTR optimization techniques and the common heuristic of weighting CTR loss by duration.
Ground Truth: Pre-analysis of Real-world Application (IV.B) Before committing to a large-scale online A/B test, the Tubi team performed essential due diligence on their production data:
- Confirming Distributional Fit: They analyzed raw viewing times from their platform. After applying Z-score normalization (to standardize the data, making it scale-invariant, and to handle the wide range of raw watch times) and truncating extreme outliers (a common preprocessing step to prevent a few very long sessions from disproportionately influencing initial analysis or visualizations), the resulting empirical distribution of watch times visually confirmed the characteristic Tweedie shape: a pronounced peak at zero and a skewed positive distribution for non-zero watch times.
- Optimizing the Power Parameter p for Tweedie: They conducted a grid search over the Tweedie distribution's parameters (μ - mean, p - power parameter, and φ - dispersion parameter). The goal was to find the parameter set that best fit the empirical distribution of their normalized, truncated real-world viewing times. The goodness-of-fit was quantified using the Kolmogorov-Smirnov (KS) statistic. The optimal value of p that minimized this KS statistic (thus making the theoretical Tweedie distribution "closest" to their observed data) was found to be approximately 1.5. This data-driven selection of p is a critical step, ensuring that the chosen variant of the Tweedie distribution is well-aligned with their specific dataset characteristics.
The Decider: Real-world Application Online Experiment (IV.C) This is where the approach demonstrates its value in a live, production environment.
- A/B Test Setup:
- Control Group: Tubi's then-current best-performing production ranking model. This was a pointwise ranker optimized with LogLoss, where individual training samples were weighted by their actual viewing time. This represents a very strong and widely adopted industrial baseline.
- Treatment Group: The proposed Tweedie Regression model. A key distinction here is that this model used equal weighting for all training samples; the responsibility for accounting for and predicting watch duration was entirely handled by the Tweedie loss function itself.
- Scale and Data: The experiment was conducted on Tubi's live user traffic, encompassing "several hundred million samples" derived from the interactions of "several million unique viewers" over a one-week period. Both the control and treatment groups were trained on an identical dataset in terms of raw features, overall size, and other data characteristics.
- Key Online Evaluation Metrics (Reported in Table III of the paper):

The reported results were:
- Revenue: +0.4% (Statistically Significant Lift)
- Total Viewing Time (device average): +0.15% (Statistically Significant Lift)
- Conversion Rate (defined as the percentage of users who watched content for more than 5 minutes during the experiment period): -0.17% (Statistically Significant DECREASE)
These online A/B test results are profoundly insightful. The Tweedie Regression model, by directly targeting the prediction of watch time, successfully moved the needle on the primary business objectives of revenue and overall user viewing time. The slight, yet statistically significant, decrease in the "conversion" metric (which acts as a proxy for a form of CTR, as users needed to watch at least 5 minutes to "convert" by this definition) is particularly telling. The authors interpret this outcome thoughtfully: "Such divergent behavior in key metrics is uncommon in typical experiments, where viewing time and conversion rate usually rise or fall together. Still, these findings are in accord with our hypothesis that optimizing for revenue or viewing time might require a trade-off in conversion, which correlates more closely with CTR." This suggests that the system, when optimized with Tweedie loss, became more adept at identifying and promoting content that would lead to longer, sustained user engagement, even if it meant that a tiny fraction fewer users initiated such engagement by crossing that specific 5-minute viewing threshold, compared to the control model which was more influenced by CTR-like signals (via weighting). It’s a classic illustration of a model specializing for a complex, high-value objective and demonstrating a willingness to make small, acceptable concessions on simpler, correlated metrics.
Wrapping Up: Conclusions and Future Horizons (Section V)
The work from Tubi provides a compelling demonstration of how a fundamental shift in problem formulation – moving from classification to regression – coupled with the selection of a statistically appropriate loss function (Tweedie) can lead to significant and measurable gains in core business metrics for VOD platforms. They have effectively shown that directly modeling and predicting watch time can be more potent than indirectly influencing it through techniques like weighting CTR-based models.
Key takeaways from this research:
- Align Loss with True Objective & Data Characteristics: The success of this approach underscores the importance of understanding the unique statistical distribution of your target variable (in this case, watch time) and aligning your modeling approach (regression with Tweedie loss) directly with your primary business goal (maximizing viewing time and, consequently, revenue).
- Moving Beyond CTR-centric Heuristics: While weighting LogLoss by duration is a useful and common heuristic, directly regressing on the target continuous variable with a loss function tailored to its distribution can unlock superior performance.
- Embrace and Understand Nuanced Metric Movements: Optimizing for a primary, complex objective might lead to non-intuitive or even slightly negative movements in secondary or simpler correlated metrics. A deep understanding of these trade-offs is crucial for making sound product and engineering decisions.
- The Enduring Power of Foundational Statistics: Tweedie regression is not a novel deep learning architecture; it's a robust statistical tool. This paper serves as an excellent example of how applying established statistical methods with fresh insight to modern, large-scale machine learning problems can yield significant improvements.
The authors suggest potential future avenues for research, including further exploration of their proposed "decomposition framework" for more sophisticated multi-objective optimization, or investigating the integration of specialized ranking losses within Mixture-of-Expert architectures to better handle diverse data patterns or multiple objectives simultaneously.