Rethinking Machine Learning in the Era of AI Product Development
Zachary Lipton (CMU, Abridge)
Zackary kicked things off with a thought-provoking session on how the paradigm for building AI products, like Abridge's conversational clinical note tools, has fundamentally shifted. He argued that the traditional "data -> model -> eval -> deploy" sequence is being inverted. Now, development often starts with prototyping capabilities (sometimes without labeled data), followed by evaluation, limited deployment, and then model training.
Key reflections:
- AI is the Product: Traditional ML training is less applicable when the AI itself, not just a component, is the core offering.
- The "Ship of Theseus" Design: Prototypes often look vastly different from MVPs, which in turn differ from the north-star solution, with less formalism guiding success. The cycle is now: model -> product -> data -> model.
- Distribution Shift Reconsidered: The old formalism around distribution shift struggles with the dynamic nature of AI products, new lexicons, and constantly evolving models.
- The Evolving AI Role: Doing AI work is becoming less like a pure statistician or engineer and more akin to a manager, dealing with "fuzzy rituals for performance evaluation."
Shoutout to Zack for doing this presentation with no slides!
Learning to Recommend via Generative Optimization
Adith Swaminathan (Netflix ML)
Adith from Netflix detailed how Large Foundation Models (LFMs) can enhance recommender systems by ingesting world knowledge and interpreting complex user feedback. However, LFMs alone aren't enough; they need integration with item catalogs and user histories. The current manual tuning of prompts and orchestration code is inefficient.
One proposed solution to help with this is Trace, an open source project for end-to-end generative optimization of these parameters.
- Trace as "PyTorch for AI workflows": It enables designers to operate at a meta-level, designing optimizers that refine agent performance iteratively.
- Optimizable Computation Graphs: Designing workflows that yield optimizable DAGs is crucial for effective learning and feedback attribution between sub-agents.
- Impressive Gains: Using feedback optimization via Trace showed a 20% improvement on the target benchmark.
- Inference Scaling: Trace is viewed as an "inference compute scaling" technique, adjacent to approaches like Chain-of-Thought, LLM-Modulo, and multi-agent orchestration.
- Challenges: Objective misalignment (ensuring workflow optimization aligns with learning objectives) and engineering robust workflows and feedback mechanisms.

Graph Transformers in Practice: Kumo’s Approach to Personalization at Scale
Hema Raghavan (Kumo)
Hema presented Kumo's approach to using Graph Transformers directly on relational data, bypassing much of the traditional manual feature engineering for personalization and risk detection. The core idea is that the subgraph around an entity can be sequentialized (like a tree) without information loss, allowing Graph Transformers to attend across multiple columns, tables, and hops.
Kumo's platform simplifies this:
- Register your relational dataset.
- Define your predictive target.
- Let the GNN learn the features, capturing deeper context for real-time predictions.
In the talk they provided a short term of how the platform works.
Synthetic Evaluations & GenAI Application Development for Finance
Edgar Meji (Bloomberg)
Edgar from Bloomberg discussed the critical role of evaluation in building AI applications for the multifaceted financial decision-making process. Given the domain-specific and critical thinking skills involved, robust evaluation is paramount from ideation to post-release monitoring.
- LLMs in Evaluation: LLMs offer promise for faster, easier, and potentially more accurate judgments and annotations, leading to the emergence of "synthetic evaluation" paradigms.
- Bloomberg's Focus: They are strategically moving towards a more agentic infrastructure, focusing on document Q&A and summarization within the Bloomberg terminal.
See more here: https://www.bloomberg.com/company/press/bloomberg-launches-gen-ai-summarization-for-news-content/
Putting the 'You' in YouTube: Better Personalization through Larger Models
Lexi Baugher (YouTube)
Lexi detailed YouTube's multi-faceted approach to leveraging larger models for their planet-sized personalization challenge, inspired by the success of LLMs.
- Approach 1: Scale Traditional Recommender Models:
- Knowledge distillation (with adaptations like auxiliary distillation) is key for managing inference costs. Distillation improves more as the teacher model gets bigger.
- TPU efficiency through quantization (e.g., bfloat16 -> int8 for a 20% speedup).
- Acknowledging the "knowledge gap" challenge: at what point is a teacher model too big to effectively teach a student?
- Approach 2: Delegate Planning to LLMs:
- LLMs, with their strengths in text and multimodal understanding, can handle topic planning, leading to higher quality exploration and discovering connections user feedback models missed.
- Approach 3: Generative Retrieval:
- Using Transformer models to complete sequences of items, moving from video IDs to "Semantic IDs" (learned representations), showed a 30% recall increase in areas like beauty. This helps with generalization while allowing other models to be used.
- Paper here
From Many Models to Few: Instacart's LLM-Driven Approach to Search and Discovery
Tejaswi (Instacart)
Tejaswi shared Instacart's journey of replacing multiple bespoke deep learning models for query understanding, retrieval, and ranking with a few powerful LLMs.
- Challenges of Conventional Search: Difficulties with query interpretation (e.g., broad queries like "snacks") and data sparsity for tail queries (e.g., "unsweetened plant-based yogurt").
- LLM for Query Understanding: A single LLM replaced multiple models for tasks like spell correction and mapping queries to ~6k product categories, leveraging world knowledge for better tail query performance.
- LLMs for Product Discovery: Generating inspirational content and relevant substitute/complementary items, though content evaluation proved harder than anticipated.
- Key Insight: Combining the world knowledge of LLMs with domain-specific knowledge (e.g., from search logs) has been incredibly fruitful.
Fireside Chat with Kevin Scott (Microsoft CTO) & Elizabeth Stone (Netflix CTO)

This panel was one of my highlights for the day. It was an engaging discussion offering high-level perspectives on AI's trajectory and Kevin Scott's approach to leading 100k engineers at Microsoft. Unfortunately it was also so engaging that I didn't write as many notes, but here's some of what I did note down:
- AI as an Enabler: AI can help manage and prioritize the overwhelming list of tasks and opportunities organizations face.
- The Future of Personalization: It will shift from standard retrieval from candidate sets to proactively "going and finding" what the user needs.
- Innovation & Leadership: When asked about communicating a bold future without sounding delusional, Kevin emphasized "show, don't tell." He also encouraged teams not to be intimidated by what giants like Google or Anthropic are doing, praising efforts like "Cursor and Windsurfer" as examples of companies making "awesome stuff."
Domain Adapting Open Weight Models to Unlock Spotify Catalog Understanding
Divita Vohra & Jacqueline Wood (Spotify)
Spotify's talk focused on making open-weight LLMs "domain-aware" by grounding them in Spotify's unique catalog.
- Entities as Tokens: They introduce structured representations of catalog entities (artists, episodes, audiobooks) using "semantic tokenization" (discretizing embeddings via techniques like LSH into "semantic IDs") and adding these to a fine-tuned LLaMA model's vocabulary.
- Use Cases: This unlocks playlist sequencing, cold-start video recommendations, personalized podcast experiences, recommendation explanations, and semantic search within the catalog.
- Promptable & Steerable: Fine-tuning LLaMA with user histories and goals allows for recommendations that can be steered by user instructions.
- Learnings: There's a clear trade-off between model generalization and semantic ID performance, and optimal training strategies/ID spaces are tightly coupled and task-dependent. They used a Llama 3.2 1B model for their experiments.
Evolution of Netflix Recommendations: Unleashing the Power of Multi-task and Foundation Models
Yang Li & Ko-Jen Hsiao (Netflix)
The final Netflix talk detailed their journey to address the scalability challenges of maintaining numerous bespoke personalization algorithms for their "Lololo" (list of lists of movies) homepage.
- The Challenge of Many Models: Slowed innovation, difficulty transferring improvements, and complexity in feature updates across many pipelines.
- Solution 1: "Hydra" Multi-Task Learning (MTL) Models:
- Consolidating diverse ranking signals and models into a single, shared model that performs multiple tasks (e.g., ranking different rows, videos, games). This simplifies the system and allows for easier integration of new business needs (like live content). They opted for an approach where different tasks are different objectives within the shared model.
- Solution 2: Integration with a Foundation Model (FM):
- Inspired by LLMs, Netflix is building a central FM that learns shared member preferences and item insights from all available data. These insights are then efficiently disseminated across downstream applications (homepage, search, messaging).
- Benefits: Simplification, faster innovation, increased leverage, and reduced redundancy.
- Practical Challenges: Handling diverse inputs, balancing tasks, avoiding negative transfer, and infrastructure considerations for cost-efficient inference.
I particularly liked the discussion around the foundational model for model inputs. This is something we're building out at Shaped.
If you haven't seen it here's the blog post about the foundational model that was posted in March: https://netflixtechblog.com/foundation-model-for-personalized-recommendation-1a0bd8e02d39
Overall Themes & Final Thoughts
Across the board, a few key themes emerged:
- LFMs/LLMs are Foundational: They are becoming integral to understanding users, content, and even optimizing the recommendation systems themselves.
- Domain Adaptation is Crucial: General models need to be infused with specific domain knowledge (Spotify's semantic tokens, Instacart's use of search logs) to be truly effective.
- The Rise of Meta-Optimization & Agentic Workflows: Systems like Netflix's Trace are pioneering how we optimize complex, non-differentiable AI agent workflows.
- Model & System Consolidation: A clear trend towards unifying many specialized models into fewer, more powerful multi-task or foundational models (Netflix's Hydra, Instacart).
- Evaluation is Evolving: From Zachay Lipton's "fuzzy rituals" to Bloomberg's synthetic evaluations, how we measure success in AI products is becoming more nuanced and product-centric.
- Scaling & Efficiency Remain Paramount: Whether it's YouTube's distillation and quantization or Netflix's infrastructure considerations for MTL models, making these powerful systems work efficiently at scale is a constant focus.
The 2025 Netflix Personalization, Search and Recommendation conference was a fantastic look into the bleeding edge of the field. It’s clear that we are in a period of rapid transformation, with new tools, architectures, and even new ways of thinking about ML development emerging at an incredible pace. The future of personalized experiences looks incredibly dynamic and powerful!
