TL;DR
- BM25 fails on multi-word intent queries. Vector retrieval fixes it. Running both together (hybrid search) is the right architecture — but fusing them without rebuilding your stack is the hard part.
- Most ranking models conflate personalization with relevance. Browse behavior dominates training data, so query signals get marginalized. You need query-item relevance and user-item preference as explicitly separate signals.
- Real-time eligibility — delivery zones, geo-dynamic availability — is the constraint that kills every vendor conversation. No static index can own it. The architecture that works layers an indexed approximation with a per-query pass-in filter from your own service.
- Availability (merchant online/offline, inventory state) is a separate signal that requires continuous streaming ingestion, not batch updates.
- All four stages — retrieve, filter, score, reorder — need to execute within a single request under 100ms. Every seam between systems is a latency cost and a maintenance burden.
The Problem Every Marketplace Search Team Eventually Hits
Here’s a pattern we see consistently across on-demand delivery platforms, local commerce marketplaces, and grocery apps: the search team is good. They’ve run experiments. They’ve tuned Elasticsearch to a degree that most engineers never reach. They understand hybrid search conceptually and want to implement it. And yet, they can’t adopt a retrieval vendor — not because vendors aren’t technically capable, but because their business logic makes clean retrieval integration structurally impossible with most systems.
The constraint varies by platform — real-time delivery eligibility, geo-dynamic availability, live inventory — but the underlying problem is the same. The retrieval system and the operational data that determines what can actually be shown to a user right now live in fundamentally different time domains. No amount of indexing fully bridges that gap.
This post is a detailed breakdown of why marketplace search is genuinely hard, what the industry’s best engineering teams have learned building these systems at scale, and what a retrieval and ranking architecture that handles all of it actually looks like.
Why Marketplace Search Is Structurally Different From E-Commerce Search
Most search systems assume a stable catalog. A product exists, it has attributes, you index it and retrieve against it. The catalog changes slowly enough that a 15-minute batch sync works fine.
Marketplace and on-demand platforms don’t work this way. The eligible catalog at any given moment — the set of merchants or items that can actually be surfaced to a specific user — is a function of real-time operational state the retrieval system can’t own. On a food delivery platform, what’s deliverable to your address is computed by a logistics model factoring in driver positions, restaurant capacity, and estimated delivery time. That model took years to build and isn’t something you’d replicate inside a retrieval system. On a grocery platform, item availability changes continuously across thousands of store locations. On a local services platform, merchant status depends on staffing and operational mode, not just catalog membership.
This creates a hard architectural constraint: any retrieval system that can’t incorporate real-time business logic without sacrificing latency or correctness is a non-starter, regardless of how good its semantic relevance is.
The teams building these platforms know this. It’s why vendor conversations keep dying on the same question: how does your system respect our eligibility rules?
The Four Root Causes of Marketplace Search Failure
1. BM25 Fails on Multi-Word and Natural Language Queries
Elasticsearch’s BM25 is excellent at keyword matching. For single-term queries — a cuisine type, a brand name, a category — it performs well. The failure is specific and predictable: multi-word, intent-rich queries get scored by token frequency rather than meaning. Modifier tokens that carry the user’s actual intent get outweighed by the primary noun.
A search for “spicy vegetarian ramen” scores high on “ramen” and may surface dishes that are neither spicy nor vegetarian. A search for “high protein lunch under 800 calories” returns every merchant with “lunch” in their metadata regardless of nutritional content. Teams respond with tuning — synonym expansion, custom analyzers, field boosting, scripted scoring — which helps at the margins but doesn’t address the root cause: BM25 has no model of semantic meaning. “Vegan noodle bowl with chili oil” and “spicy vegetarian ramen” will never be close in a BM25 index, regardless of how you tune it.
This is a documented, production problem at scale. DoorDash’s engineering team published a detailed account of this exact failure mode. They built a dedicated LLM-based query understanding layer specifically because BM25 failed on dish-level semantic queries. Their improved retrieval substantially increased the trigger rate for their popular dish carousels — and because better retrieval produced a richer candidate set, they were able to retrain their ranker against improved data, yielding a confirmed 1.6% lift in order conversion as a downstream effect. The retrieval failure was the root cause; the ranking gain was only unlocked once retrieval was fixed.
Uber Eats encountered this at the zero-result extreme: a user searching for “tan tan noodles” with no restaurant offering that exact dish would get no results under BM25. Their fix was a semantic query expansion system built on a food knowledge graph and representation learning, which expanded the query to related concepts and retrieved semantically relevant restaurants that token matching made invisible.
The right retrieval architecture is hybrid: BM25 for exact-match and navigational queries, vector ANN for semantic and natural language queries, candidates from both fused with configurable weights per query type. The engineering complexity is not running either system in isolation — it’s running both together and tuning the fusion ratio without rebuilding infrastructure every time you want to experiment.
The urgency is increasing. LLM-powered interfaces have changed what users expect from a search box. Users now routinely type full sentences into search bars, expecting intent-aware responses. Platforms whose retrieval can’t handle natural language don’t just return poor results — they create a learned helplessness. Users who fail get a bad result once, adapt their behavior permanently, and stop using search for anything exploratory. This behavioral shift is nearly invisible in standard search metrics because the denominator — users who attempt natural language queries — shrinks alongside the numerator.
2. Ranking Models Conflate Personalization With Query Relevance
Many marketplace teams have a ranking model in production. It’s trained on clicks and orders, it captures genuine user preference, and it works well for browse. The problem is that the same model gets used for search — and in that context, personalization signals dominate in a way that effectively erases query intent.
A model trained jointly on browse and search interactions learns that user history is a far stronger predictor of engagement than query-item relevance. The query signal exists as an input feature, but its effective weight collapses to near zero during training because personalization features have higher variance across the full mixed dataset. The result: a user with a history of Italian orders who searches for Japanese food gets Italian results — because that’s what the training data says this user clicks on.
This is a training data distribution problem, not a model architecture problem. A sophisticated model that formally accepts query text as a feature can functionally ignore it if the personalization features dominate the gradient signal.
Uber Eats documented exactly this failure in their recommender system. A personalized relevance model trained primarily on engagement data would determine that a user who repeatedly ordered ramen should continue to see ramen results — regardless of their actual search query. Their solution required explicit multi-objective optimization to give query-item relevance signals appropriate independent weight.
The correct fix is treating query-item relevance and user-item preference as separately weighted signals, not a single model where personalization can override intent:
final_score = α × query_relevance(query, item) + β × user_preference(user, item)
Where α is explicitly calibrated to be meaningful for intentful search contexts, not collapsed toward zero by a joint training objective that browse behavior dominates.
3. Real-Time Eligibility Filtering Breaks Every Naive Retrieval Architecture
This is the constraint that ends most vendor conversations. On an on-demand delivery platform, the eligible catalog per user is computed in real time from a logistics model — driver positions, restaurant capacity, estimated delivery times, and geo precision fine enough that adjacent addresses can have meaningfully different eligible sets. That computation is proprietary, too complex to replicate externally, and too dynamic to pre-cache completely.
inside one API call
Uber Eats encountered the geo-precision version of this problem when scaling their delivery search. Their H3-based hexagonal delivery zone mapping was efficient for lookups, but upstream ingestion labeled stores as “nearby” or “far” without passing actual ETA data. Their ranker treated a 5-minute and a 30-minute delivery as identical candidates, surfacing popular distant chains over closer, faster options. The fix required restructuring ingestion to propagate ETA signals and updating the ranking layer to use them — a multi-team project that stemmed from a single data pipeline gap.
4. Availability Is a Separate Real-Time Signal That Needs Its Own Ingestion Path
Distinct from delivery eligibility, item and merchant availability changes at a different layer and at high frequency: merchants go offline, items sell out, capacity fills. This signal can’t be treated as a batch catalog update.
Instacart documented this failure in detail. Their Elasticsearch-based search infrastructure received billions of writes per day from price and availability changes. The indexing load degraded read performance to the point where stale data corrections could take days to propagate — in a catalog where prices and stock status change multiple times daily, this directly caused dead-end searches. They ultimately migrated off Elasticsearch entirely, citing the structural incompatibility between their availability write workload and Elasticsearch’s indexing model. The migration produced an 80% reduction in indexing costs and eliminated the stale availability problem.
Their solution to the hot-path availability problem is instructive: calling their Real-Time Availability ML model via RPC during retrieval added too much latency. Instead, they pre-store availability scores in a queryable database joined at retrieval time, with a lazy refresh mechanism that updates scores when an item appears in results and the cached score exceeds a staleness threshold.
The broader lesson: availability signals require a continuously-updating ingestion path, decoupled from the retrieval index update cycle, with a serving architecture that allows the availability layer to be queried at retrieval speed.
The Retrieval Architecture That Handles All Four
The pipeline that works for a marketplace has four stages executing within a single request. Every stage maps directly to one of the problems above.
| Stage | What It Does | Problem It Solves |
|---|---|---|
| 01Retrieve | BM25 + vector ANN in parallel, candidates fused with configurable weights | Multi-word query failureNL query failure |
| 02Filter | Real-time eligibility + availability applied inline at query time | Eligibility correctnessStale availability |
| 03Score | Query-aware cross-encoder + personalized CTR/conversion model, separately weighted | Ranker conflating relevanceand personalization |
| 04Reorder | Diversity across the full result set | Over-concentration in carouselsTail-catalog suppression |
The total round-trip across all four stages must target under 100ms. This is the latency budget. Every architectural decision — which embedding model, how to integrate the eligibility service, how much candidate over-fetch — gets made against this number.
Stage 1: Hybrid Retrieval
Shaped runs BM25 (via Tantivy, a Rust-native Lucene implementation, ~5-6x faster than standard Lucene) and vector ANN (via LanceDB, distributed) simultaneously. Candidates from both pools are fused in a single query with fully configurable blend weights. A brand-name navigational query weights BM25 heavily; a natural language query like “something warm and filling for a cold evening” weights vector retrieval heavily.
The vector embedding model is selectable from any Hugging Face encoder and fine-tunable on your domain catalog. In-domain fine-tuning on a food or grocery catalog meaningfully outperforms a generic sentence transformer — the embedding space learns the semantic relationships specific to your vertical.
-- Hybrid retrieval: BM25 + vector in a single query, weights configurable
SELECT *
FROM text_search(query='$query_text', mode='vector',
text_embedding_ref='catalog_embedding', limit=100,
name='semantic'),
text_search(query='$query_text', mode='lexical', fuzziness=2, limit=100,
name='keyword')
ORDER BY score(
expression='0.6 * retrieval.semantic + 0.3 * retrieval.keyword + 0.1 * click_through_rate',
input_user_id='$user_id'
)
LIMIT 20
Stage 2: Eligibility and Availability Filtering
Shaped supports two integration patterns for real-time eligibility, designed to be layered based on your eligibility system’s architecture:
Pass-in filter: Your system computes the eligible set at request time (by calling your own eligibility service) and passes it as a WHERE clause directly to the Shaped query. Retrieval executes over the eligible set only — not the full catalog.
-- Eligibility enforced at query time via pass-in filter
SELECT *
FROM text_search(query='$query_text', mode='vector',
text_embedding_ref='catalog_embedding', limit=50)
WHERE item_id IN ($eligible_ids)
ORDER BY score(expression='click_through_rate', input_user_id='$user_id')
LIMIT 10
Indexed approximation + post-filter: For cases where the eligible set is too large to pass per-request, Shaped maintains a geo-regional eligibility approximation via streaming connectors (Kafka, Kinesis — data lands within ~30 seconds of change events). Retrieval narrows candidates using the approximation; your real-time eligibility service applies an exact final filter over the reduced set. This pattern avoids the over-retrieval problem of pure post-filtering and the payload problem of pure pre-filtering.
Availability signals — merchant online/offline state, item inventory — use the same streaming ingestion path and update continuously.
Stage 3: Query-Aware Re-Ranking
Shaped’s scoring layer supports explicitly separated query-item relevance and user-item preference signals — not a single joint model where personalization can dominate:
-- Query relevance and personalization as separately weighted signals
ORDER BY score(
expression='0.5 * colbert_v2(query=$query_text, item=item)
+ 0.3 * click_through_rate
+ 0.2 * conversion_rate',
input_user_id='$user_id',
input_interactions_item_ids='$interaction_item_ids'
)
The colbert_v2() or cross_encoder() term scores query-item relevance directly and independently. The CTR and conversion terms come from models trained on your interaction data. The blend weights are explicit and tunable — changing them is a config update, not a model retrain.
Shaped trains CTR and conversion models automatically when you connect an interaction table. Evaluation runs against held-out test sets with standard IR metrics (NDCG@k, Recall@k, Precision@k), segmented by user cohort and query type.
Stage 4: Result Diversity
For carousel-based home pages, the REORDER BY diversity() step prevents result concentration — ensuring the same merchant type doesn’t dominate across multiple carousels, and that tail-catalog merchants receive proportional exposure within score-competitive ranges.
Why a Unified System Is the Actual Differentiator
Running all four stages as a unified system — rather than four separately owned tools — is where the real engineering value is. Every seam between systems is a latency cost, an operational failure mode, and a maintenance surface.
Building the equivalent in-house means: custom scripted scoring functions in Elasticsearch for hybrid fusion, separate vector index management, a distinct ML pipeline and feature store for re-ranking, and a custom serving layer to tie the eligibility integration together. The total system is expensive to build, operationally fragile, and slow to experiment with. Every change to fusion weights, embedding models, or ranking signals requires touching multiple systems.
Shaped packages retrieval, filtering, scoring, and reranking in a single declarative query. Changing a blend weight, swapping an embedding model, or adding a filter is a config update — not a multi-service deployment. For teams that want to move fast on experimentation — different fusion ratios, different encoders, different ranking signal combinations — this is the difference between a one-hour change and a two-week engineering project.
The Real Business Cost of Getting This Wrong
The business case for fixing marketplace search isn’t about search quality metrics in isolation. It’s about a compounding behavioral loop that’s hard to detect and harder to reverse.
Users who try natural language queries and get irrelevant results don’t keep trying — they learn that the search box only handles simple inputs and adapt permanently. At scale, this means search traffic concentrates toward navigational queries (the easiest case, handled acceptably by BM25) while exploratory-intent users self-select out. Standard success metrics look stable because the denominator shrinks alongside the numerator.
The effect is nearly invisible: as fewer users attempt natural language search, both the numerator (successful NL searches) and the denominator (total NL attempts) fall together. Your success rate looks stable in dashboards. Your funnel has narrowed. You cannot measure what users stopped trying.
The engineering investment question also isn’t just about initial build time. A full hybrid retrieval + query-aware re-ranking + eligibility integration stack carries ongoing ownership costs: tuning fusion weights as the catalog evolves, retraining ranking models as user behavior shifts, debugging latency regressions across a multi-stage pipeline, keeping eligibility integration current as the logistics model changes. These are continuous maintenance commitments, not one-time projects. Every sprint spent on retrieval infrastructure is a sprint not spent on features that differentiate the product.
What a Meaningful Evaluation Looks Like
The only useful evaluation of a retrieval system for a marketplace is against your own data. Public benchmarks like MS MARCO don’t have your catalog, your query distribution, your user interaction patterns, or your eligibility constraints. DoorDash’s engineering team made this point directly: the nuances of their popular dish retrieval only became clear once evaluated against their actual production query logs.
An effective proof of concept with Shaped involves:
- Ingesting a sample of your merchant catalog and interaction history
- Configuring the engine to your target hybrid blend, embedding model, and ranking signals
- Evaluating against your known failure cases — the multi-word queries where current retrieval breaks, the eligibility edge cases, the carousels where result quality is already known to be poor
- Seeing the delta on queries that matter to your product, not a benchmark number on a public dataset
We bring the retrieval infrastructure and ML expertise. You bring the domain knowledge and the failure cases. That combination produces a meaningful signal on whether this architecture fits your constraints — typically within two to three weeks of starting.
Book a technical evaluation with the Shaped engineering team →
Related Reading
- How Shaped handles real-time ranking for marketplace feeds →
- ShapedQL documentation: hybrid retrieval and scoring expressions →
- Case study: improving search recall for a local commerce platform →
Shaped is a real-time retrieval database for search, recommendations, and ranking. Define retrieval engines declaratively, connect data sources via 20+ connectors including Kafka and Kinesis, and serve queries via REST with a sub-100ms latency target. See the docs →
Primary keyword: marketplace search architecture Secondary keywords: hybrid search food delivery, real-time eligibility filtering search, BM25 vector search on-demand, retrieval ranking pipeline ecommerce, on-demand delivery search relevance Meta description: Why BM25, naive eligibility filtering, and joint ranking models fail for on-demand delivery and marketplace search — and the four-stage retrieval architecture that handles all of it, with examples from DoorDash, Uber Eats, and Instacart.