Sub-100ms Discovery: Why Retrieval Speed is the Agent Bottleneck

If your retrieval takes 2 seconds and your LLM takes 3, your agent is unusable. Learn why sub-100ms retrieval is critical for production agents and how Shaped's Redis-backed fast_tier architecture delivers ranked queries at scale without sacrificing accuracy.

Sub-100ms Discovery: Why Retrieval Speed is the Agent Bottleneck

Quick Answer: Why 2 Seconds Feels Like Forever

An AI agent’s response time is the sum of its parts: retrieval latency + LLM inference + any post-processing. If your retrieval layer takes 2 seconds to find relevant context, and your LLM takes 3 seconds to generate a response, your user waits 5 seconds. That’s unusable.

Users abandon agents that take longer than 2 seconds to respond. Google found that a 500ms delay in search results led to a 20% drop in traffic. For conversational agents, the threshold is even lower — anything over 1-2 seconds feels broken.

Key Takeaways:

  • Retrieval is often the bottleneck — LLM inference is getting faster (sub-second with optimized models), but retrieval hasn’t kept pace
  • Sub-100ms retrieval unlocks real-time agents — When context retrieval is near-instant, the LLM becomes the only meaningful latency
  • Speed vs accuracy is a false trade-off — Redis-backed architectures with approximate nearest neighbor (ANN) indexes deliver both
  • The four-stage pipeline adds up — Retrieve, filter, score, reorder — each stage contributes latency that compounds
  • Production scale requires dedicated infrastructure — General-purpose vector DBs hit latency ceilings under load

Time to read: 22 minutes | Includes: 6 code examples, 3 architecture diagrams, 2 comparison tables


Table of Contents

  1. The Latency Math Problem
  2. Why Retrieval is the Bottleneck
  3. The Four-Stage Pipeline Tax
  4. Part 1: The Traditional Approach
  5. Part 2: The Shaped Way — Fast Tier Architecture
  6. Speed vs Accuracy: Not a Trade-off
  7. Comparison Table
  8. FAQ

The Latency Math Problem

Imagine a customer support agent that helps users track orders. A customer asks:

User: “Where is my order?”

Here’s what happens behind the scenes:

1. Agent receives query → 0ms baseline
2. Retrieval layer searches for order context → 1,800ms
3. LLM generates response with retrieved context → 2,500ms
4. Agent returns response → Total: 4,300ms

4.3 seconds. The user has already opened a new tab to check their order manually.

The User Experience Threshold

Research on response time tolerance is consistent across studies:

Latency RangeUser Perception
< 100msFeels instant
100–300msPerceptible delay, but acceptable
300ms–1sNoticeable lag, user stays engaged
1–3sUser attention wavers, feels slow
> 3sUser abandons, considers the system broken

For conversational agents, the threshold is even stricter. Chat interfaces set an expectation of real-time responsiveness. A 5-second delay feels like the agent crashed.

The Latency Budget

If you have a 2-second budget for total response time (the upper limit of acceptable), here’s how it breaks down:

Total budget: 2,000ms

LLM inference (GPT-4 Turbo, optimized): 800-1,200ms
Post-processing (formatting, safety checks): 50-100ms

Remaining for retrieval: 700-1,150ms

If your retrieval layer takes 1,500ms, you’ve already blown the budget before the LLM even starts.


Why Retrieval is the Bottleneck

LLMs Are Getting Faster

LLM inference latency has dropped dramatically:

  • GPT-3.5 Turbo: 500-800ms for typical responses
  • GPT-4 Turbo (optimized): 800-1,200ms
  • Claude 3 Haiku: 300-600ms
  • Llama 3 (self-hosted, optimized): 200-500ms

With techniques like speculative decoding, batching, and quantization, inference latency continues to improve. Sub-second LLM responses are now standard for production-optimized deployments.

Retrieval Hasn’t Kept Pace

Meanwhile, retrieval latency for vector databases remains stubbornly high:

Typical vector DB latencies (under load):

  • Pinecone (serverless tier): 150-500ms per query
  • Weaviate (cloud): 200-800ms per query
  • Qdrant (self-hosted): 100-400ms per query
  • Chroma (embedded): 50-300ms per query

These are per-query latencies for a single vector search. Production agents often require:

  • Multiple retrieval calls (search orders, search products, search knowledge base)
  • Filtering by user ID, date range, or other attributes
  • Re-ranking with cross-encoders or scoring models
  • Hybrid search (vector + keyword)

Each additional operation compounds latency.

The Real-World Impact

Here’s what a “simple” customer support query actually looks like:

Query: “Where is my order for the blue dress I bought last week?”

Required retrievals:

  1. Search user’s orders (vector search on order history) → 250ms
  2. Filter orders by date range (last 7 days) → 50ms
  3. Search products matching “blue dress” (vector search on catalog) → 300ms
  4. Cross-reference order items with product search → 100ms
  5. Re-rank by relevance + recency → 150ms

Total retrieval latency: 850ms

Add LLM inference (1,000ms) and you’re at 1,850ms. That’s before any post-processing, safety checks, or retries.

Why Traditional Vector DBs Are Slow

1. Network round trips Vector DBs are separate services. Every query is a network call:

Agent → Network → Vector DB → Network → Agent

Even on low-latency networks, this adds 10-50ms per round trip. Multi-stage queries (retrieve, filter, re-rank) multiply this overhead.

2. Cold starts and connection pooling Serverless vector DBs suffer from cold start latency (100-500ms for the first query after idle). Connection pooling helps, but pool exhaustion under load forces new connection handshakes.

3. Index structure overhead HNSW (Hierarchical Navigable Small World) indexes — the most common ANN algorithm — require multiple graph traversals. At scale (millions of vectors), even optimized HNSW queries take 100-300ms.

4. Compute-storage separation Cloud vector DBs separate compute from storage (S3, GCS). Fetching vectors from remote storage adds latency, especially for large result sets.


The Four-Stage Pipeline Tax

Production retrieval isn’t a single vector search. It’s a multi-stage pipeline:

The four-stage retrieval pipeline: Retrieve, Filter, Score, Reorder

Each stage adds latency.

Query: “red floral dresses under $100”

Stage 1: Retrieve candidates (vector search)

  • Embed query: 20ms
  • Search product embedding index: 200ms
  • Stage 1 total: 220ms

Stage 2: Filter

  • Apply price filter (< $100): 30ms
  • Apply stock availability filter: 20ms
  • Stage 2 total: 50ms

Stage 3: Score

  • Retrieve user’s purchase history: 100ms
  • Score with personalization model: 80ms
  • Stage 3 total: 180ms

Stage 4: Reorder

  • Apply diversity (avoid showing same brand 3x): 40ms
  • Promote trending items: 30ms
  • Stage 4 total: 70ms

Total pipeline latency: 520ms

This is for a single search. If the agent needs to search both products and reviews, double it. Add another retrieval for user’s order history? Now you’re at 1,500ms+.


Part 1: The Traditional Approach (Pinecone / Weaviate)

The standard architecture for agent retrieval uses a managed vector database as a separate service. You embed your data, upsert vectors to the vector DB, and query it at runtime.

Architecture

[Agent Query]
[Embedding Service]
Embed query text (20-50ms)
[Network Round Trip]
HTTP/gRPC call to vector DB (10-50ms)
[Vector DB: Pinecone / Weaviate / Qdrant]
Stage 1: ANN search on HNSW index (100-300ms)
Stage 2: Apply metadata filters (20-80ms)
Stage 3: Fetch full records from storage (50-150ms)
Stage 4: Return results
[Network Round Trip]
Return to agent (10-50ms)
[Agent]
LLM call with retrieved context

Total retrieval latency: 210-680ms (best case, single query, no re-ranking)

Implementation

Step 1: Index documents in Pinecone

# index_documents_pinecone.py
import pinecone
from sentence_transformers import SentenceTransformer

# Initialize
pinecone.init(api_key="your-api-key", environment="us-west1-gcp")
index = pinecone.Index("product-catalog")

# Embed and upsert documents
embedder = SentenceTransformer('all-MiniLM-L6-v2')

products = [
    {"id": "prod-001", "name": "Navy Blue Floral Dress", "price": 89.99, "category": "Dresses"},
    {"id": "prod-002", "name": "Red Maxi Dress", "price": 124.99, "category": "Dresses"},
    # ... thousands more
]

vectors = []
for product in products:
    embedding = embedder.encode(product['name']).tolist()
    vectors.append({
        "id": product['id'],
        "values": embedding,
        "metadata": {
            "name": product['name'],
            "price": product['price'],
            "category": product['category']
        }
    })

# Upsert in batches
batch_size = 100
for i in range(0, len(vectors), batch_size):
    index.upsert(vectors=vectors[i:i+batch_size])

Step 2: Query at runtime

# agent_retrieval_pinecone.py
import time
import pinecone
from sentence_transformers import SentenceTransformer

embedder = SentenceTransformer('all-MiniLM-L6-v2')
index = pinecone.Index("product-catalog")

def retrieve_context(query: str, filters: dict = None, top_k: int = 10):
    """
    Retrieve relevant products for agent context.
    """
    start = time.time()

    # Embed query
    query_embedding = embedder.encode(query).tolist()
    embed_time = time.time() - start

    # Query Pinecone
    query_start = time.time()
    results = index.query(
        vector=query_embedding,
        top_k=top_k,
        filter=filters,  # e.g., {"price": {"$lt": 100}}
        include_metadata=True
    )
    query_time = time.time() - query_start

    total_time = time.time() - start

    print(f"Embedding: {embed_time*1000:.0f}ms | Query: {query_time*1000:.0f}ms | Total: {total_time*1000:.0f}ms")

    products = []
    for match in results['matches']:
        products.append({
            'id': match['id'],
            'name': match['metadata']['name'],
            'price': match['metadata']['price'],
            'score': match['score']
        })

    return products


# Usage
query = "red floral dress under $100"
filters = {"price": {"$lt": 100}}
products = retrieve_context(query, filters=filters, top_k=20)

# Typical output:
# Embedding: 45ms | Query: 285ms | Total: 330ms

Step 3: Multi-stage pipeline with re-ranking

# multi_stage_retrieval.py
import time

def multi_stage_retrieval(query: str, user_id: str):
    """
    Four-stage pipeline: retrieve, filter, score, reorder.
    """
    start = time.time()

    # Stage 1: Retrieve candidates (vector search)
    stage1_start = time.time()
    candidates = retrieve_context(query, top_k=100)  # ~330ms
    stage1_time = time.time() - stage1_start

    # Stage 2: Filter (apply business rules)
    stage2_start = time.time()
    in_stock = [p for p in candidates if p.get('in_stock', True)]
    user_prefs = get_user_preferences(user_id)  # +150ms for DB query
    filtered = [p for p in in_stock if p.get('brand') in user_prefs.get('brands', [])]
    stage2_time = time.time() - stage2_start

    # Stage 3: Score (personalization model)
    stage3_start = time.time()
    user_history = get_user_history(user_id)  # +200ms for DB query
    scored = score_with_personalization(filtered, user_history)  # +120ms model inference
    stage3_time = time.time() - stage3_start

    # Stage 4: Reorder (diversity, trending)
    stage4_start = time.time()
    reordered = apply_diversity(scored)  # +60ms
    stage4_time = time.time() - stage4_start

    total_time = time.time() - start

    print(f"Stage 1 (retrieve): {stage1_time*1000:.0f}ms")
    print(f"Stage 2 (filter):   {stage2_time*1000:.0f}ms")
    print(f"Stage 3 (score):    {stage3_time*1000:.0f}ms")
    print(f"Stage 4 (reorder):  {stage4_time*1000:.0f}ms")
    print(f"Total:              {total_time*1000:.0f}ms")

    return reordered[:10]

# Typical output:
# Stage 1 (retrieve): 330ms
# Stage 2 (filter):   180ms
# Stage 3 (score):    320ms
# Stage 4 (reorder):   60ms
# Total:              890ms

What You’re Operating

ComponentWhat It IsLatency Contribution
Embedding serviceSentenceTransformer model (CPU/GPU)20–50ms per query
Vector DB (Pinecone)Managed service, network round trip10–50ms (network) + 100–300ms (ANN search)
Metadata filtersApplied during vector search+20–80ms
User preference lookupPostgreSQL query for user data+100–200ms
Personalization modelSeparate service or in-process+80–150ms
Diversity reorderingIn-process algorithm+40–80ms

Total: 370-910ms for a single multi-stage query.

The cost:

  • Latency: 370-910ms per query (compounds with multiple retrievals)
  • Infrastructure: Managed vector DB ($0.10-0.50 per million queries) + embedding service + personalization service
  • Code to maintain: ~600 lines (embedding pipeline, query orchestration, filtering logic, re-ranking)
  • Operational complexity: Monitor vector DB health, manage connection pools, handle cold starts

Part 2: The Shaped Way — Fast Tier Architecture

Shaped’s fast_tier deployment uses Redis-backed storage with pre-computed indexes to deliver fast retrieval. The entire four-stage pipeline (retrieve, filter, score, reorder) runs in a single, optimized query.

Architecture

Shaped fast_tier architecture: single API call replacing four-stage pipeline

Key differences:

  • No embedding service — Embeddings are pre-computed and stored in Redis
  • No network round trips — The entire pipeline runs in-process
  • No separate scoring service — Model predictions are materialized and cached
  • No external database lookups — User history and preferences are indexed in Redis

Implementation

Step 1: Define engine with fast_tier

# fast_product_engine.yaml
version: v2
name: product_search_fast
data:
  item_table:
    name: products_enriched
    type: table
  interaction_table:
    name: user_views
    type: table
encoder:
  name: text-embedding-3-small
  provider: openai
  columns:
    - name: product_name
      weight: 0.6
    - name: image_description
      weight: 0.4
training:
  models:
    - name: personalization_model
      policy_type: elsa
      strategy: early_stopping
deployment:
  data_tier: fast_tier  # ← Redis-backed storage
  server:
    worker_count: 4
  online_store:
    interaction_max_per_user: 50
    interaction_expiration_days: 30
shaped create-engine --file fast_product_engine.yaml

What this does:

  • data_tier: fast_tier provisions Redis-backed storage for all indexes
  • Pre-computes embeddings for all products (done at indexing time, not query time)
  • Materializes personalization model predictions in Redis
  • Caches user interaction history in the online store (in-memory)

Step 2: Query the engine

# agent_retrieval_shaped.py
import requests
import time

SHAPED_API_KEY = "your-api-key"

def retrieve_context(query: str, user_id: str, filters: dict = None, limit: int = 10):
    """
    Single API call for full retrieval pipeline.
    """
    start = time.time()

    response = requests.post(
        "https://api.shaped.ai/v2/rank",
        headers={"x-api-key": SHAPED_API_KEY},
        json={
            "engine_name": "product_search_fast",
            "query": query,
            "user_id": user_id,
            "candidates": {
                "table": "products_enriched",
                "filter": filters  # e.g., "price < 100 AND in_stock = true"
            },
            "limit": limit,
            "use_model": "personalization_model"
        }
    )

    latency = (time.time() - start) * 1000
    print(f"Total retrieval latency: {latency:.0f}ms")

    results = response.json()

    products = []
    for result in results['results']:
        products.append({
            'id': result['product_id'],
            'name': result['product_name'],
            'price': result['price'],
            'score': result['_score']
        })

    return products

# Usage
query = "red floral dress under $100"
user_id = "user-9472"
filters = "price < 100 AND category = 'Dresses' AND in_stock = true"
products = retrieve_context(query, user_id, filters=filters, limit=20)

# Typical output:
# Total retrieval latency: 42ms

Latency breakdown (internal):

Stage 1 (retrieve): ANN search on Redis-backed index → 12ms
Stage 2 (filter):   Apply filters on indexed metadata → 6ms
Stage 3 (score):    Fetch materialized model scores   → 14ms
Stage 4 (reorder):  Diversity algorithm                → 8ms
──────────────────────────────────────────────────────────
Total:                                                   40ms

Step 3: Multi-query agent workflow

# agent_multi_query.py
import time

def agent_search(user_query: str, user_id: str):
    """
    Agent needs context from multiple sources.
    """
    start = time.time()

    # Query 1: Search products
    products = retrieve_context(
        query=user_query,
        user_id=user_id,
        filters="category = 'Dresses' AND in_stock = true",
        limit=10
    )  # ~42ms

    # Query 2: Search user's order history
    orders = retrieve_context(
        query=user_query,
        user_id=user_id,
        filters="user_id = '" + user_id + "'",
        limit=5
    )  # ~38ms

    # Query 3: Search similar products
    similar = retrieve_context(
        query=products[0]['name'] if products else user_query,
        user_id=user_id,
        limit=5
    )  # ~41ms

    total_retrieval = (time.time() - start) * 1000
    print(f"Total multi-query retrieval: {total_retrieval:.0f}ms")

    context = {
        'products': products,
        'orders': orders,
        'similar': similar
    }

    return context

# Typical output:
# Total retrieval latency: 42ms
# Total retrieval latency: 38ms
# Total retrieval latency: 41ms
# Total multi-query retrieval: 121ms

Compare this to traditional:

  • Traditional multi-query: 330ms × 3 = 990ms
  • Shaped fast_tier: 42ms × 3 = ~121ms (real-world)

8x faster.


Speed vs Accuracy: Not a Trade-off

A common misconception: “faster retrieval = worse accuracy.” This is false. Shaped’s fast_tier uses the same ANN algorithms (HNSW) as standalone vector DBs, but optimized for in-process execution with Redis-backed storage.

Why Shaped is Faster Without Sacrificing Accuracy

1. Elimination of network overhead Traditional vector DBs require network calls between services. Every query travels:

Agent → Embedding Service (network hop #1)
Embedding Service → Vector DB (network hop #2)
Vector DB → Scoring Service (network hop #3)
Scoring Service → Agent (network hop #4)

Even on low-latency networks, each hop adds 5-20ms. Shaped runs the entire pipeline in-process — zero network overhead between stages.

2. Pre-computed embeddings Embedding models (even fast ones like text-embedding-3-small) take 15-30ms per query. Shaped pre-computes and caches embeddings for all documents. At query time, the agent query is the only text that needs embedding — a one-time 15ms cost, not repeated for every document.

3. Materialized model predictions Personalization models typically add 80-150ms per query when run on-demand. Shaped materializes model predictions in Redis during indexing. Retrieval fetches pre-computed scores directly — no model inference at query time.

4. Redis-backed indexes Redis provides sub-millisecond GET operations (0.1-1ms). HNSW graph traversal in-memory takes 10-30ms. Combined: 15-50ms for the full ANN search, compared to 100-300ms for disk-backed vector DBs that fetch data from S3/GCS.

Accuracy remains identical — the same HNSW algorithm with the same ef parameter produces the same recall. The speed gain comes from architectural optimization, not algorithm shortcuts.


Comparison: Traditional vs Shaped

ComponentTraditional (Pinecone)Shaped (fast_tier)
Single vector query150–500ms*30–100ms*
Multi-stage pipeline (4 stages)370–910ms*40–100ms*
Multi-query (3 retrievals)450–1,500ms*90–300ms*
Embedding at query timeYes (+20–50ms per query)No (pre-computed)
Network round trips2–4 per query1 per query (HTTP to Shaped API)
Separate scoring serviceYes (+80–150ms)No (materialized in Redis)
Cold start latency100–500ms (serverless)0ms (always warm)
Infrastructure costPinecone: ~$0.10–0.50 per 1M queries + embedding serviceShaped: $42 per 1M queries (includes all stages)
Code to maintain~600 lines (orchestration + embedding + scoring)~40 lines (YAML config + API call)
Time to production3–6 weeks< 7 days

*Latencies based on typical production deployments under moderate load. Actual performance varies based on data size, query complexity, and infrastructure configuration.


FAQ

Q: How does Shaped achieve low latency?

A: Four optimizations: (1) Redis-backed storage for sub-millisecond data access, (2) pre-computed embeddings (no embedding service call at query time), (3) materialized model predictions cached in Redis, (4) in-process execution of the full four-stage pipeline with no network round trips between stages.

Q: Does fast_tier sacrifice accuracy for speed?

A: No. Shaped uses the same ANN algorithms (HNSW) as standalone vector DBs. The speed gain comes from eliminating network overhead and materializing intermediate results, not from reducing index quality or changing the algorithm.

Q: What’s the cost of fast_tier?

A: Shaped pricing is usage-based: $42 per million queries, $5/hour for GPU training/encoding, and $0.75-2.25/GB for data storage. Standard plan starts at $500/month minimum. This includes the entire pipeline (retrieval + filtering + scoring + reordering) in a single API call. Traditional stacks require separate costs for vector DB + embedding service + scoring service.

Q: Can I use fast_tier with real-time data?

A: Yes. fast_tier works with all Shaped connectors (Kafka, Kinesis, Segment, etc.). New data is indexed within 30 seconds and immediately available for fast queries.

Q: How does this compare to self-hosting Qdrant or Weaviate?

A: Self-hosted vector DBs can achieve 100-200ms latency with careful tuning, but you still have network overhead between services, need to manage embedding pipelines, and run separate scoring/re-ranking. Shaped consolidates all of this into a single, optimized system.

Q: What happens if my query load spikes?

A: Shaped auto-scales based on load. You can configure auto-scaling thresholds in the deployment block. Queries are routed to healthy instances with sub-10ms overhead.

Q: Does fast_tier work for hybrid search (vector + keyword)?

A: Yes. Shaped supports hybrid search with BM25 lexical search combined with vector similarity. Both run in-process with the same low-latency profile.


Conclusion

Retrieval latency is the bottleneck in production agents. LLM inference is getting faster, but if your retrieval layer takes 2 seconds, your agent is unusable.

The traditional approach — managed vector DB + separate embedding service + external scoring — compounds latency at every stage. A “simple” multi-stage query takes 370-910ms. Multi-query workflows (common in production) hit 1,000ms+.

Shaped’s fast_tier architecture eliminates the bottleneck: Redis-backed storage, pre-computed embeddings, materialized model predictions, and in-process execution deliver retrieval in 30-100ms. The entire four-stage pipeline (retrieve, filter, score, reorder) runs unified with no network hops between stages.

If your agent feels slow, check your retrieval layer first. Chances are, it’s the bottleneck.

Get up and running with one engineer in one sprint

Guaranteed lift within your first 30 days or your money back

100M+
Users and items
1000+
Queries per second
1B+
Requests

Related Posts

$1.9M Funding Round
Apr 27, 2022
 | 
2

$1.9M Funding Round

10 Best Practices in Data Ingestion: A Scalable Framework for Real-Time, Reliable Pipelines
Jun 11, 2025
 | 
9

10 Best Practices in Data Ingestion: A Scalable Framework for Real-Time, Reliable Pipelines

5 Best APIs for Adding Personalized Recommendations to Your App in 2025
Aug 19, 2025
 | 
4

5 Best APIs for Adding Personalized Recommendations to Your App in 2025