Context Window Optimization: Why Ranking, Not Stuffing, Is the Scaling Law for Agents

Your agent's biggest bottleneck isn't the model. It's what you put in the context window. Attention cost scales quadratically with input length, doubling your context doesn't double your cost, it quadruples it. Learn why retrieval ranking is the most important infrastructure problem in production agents, and how to solve it with multi-retriever pipelines.

Feb 27, 2026

min read

Shaped Team

Quick Answer: Why Your Agent Is Slow and Expensive

The defining problem of production agents isn’t model intelligence. It’s context window management. Attention cost scales quadratically with input length, doubling your context doesn’t double your cost, it quadruples it. Every extra chunk you stuff into the prompt makes everything slower, more expensive, and paradoxically, less accurate.

The fix isn’t a bigger context window. It’s a smarter one. Instead of passing 200 unranked chunks to your LLM, you pass 10 ranked results, selected by ML models trained on your data. Less context. Better answers. Lower cost.

Key Takeaways:

Attention is quadratic, 2x the context tokens = 4x the compute cost. This is the fundamental constraint, and no model architecture change is fixing it soon.
Stuffing the context window degrades quality, LLMs struggle with irrelevant context. More ≠ better. Research consistently shows that retrieval precision matters more than recall for downstream task performance.
Cosine similarity is not ranking, vector search returns “similar” results. Trained ranking models return useful results. These are different things.
Real-time signals matter more than model depth, for retrieval, the scaling law isn’t about parameter count. It’s about data freshness. What a user interacted with 30 seconds ago matters more than a 10% better embedding model.
Multi-retriever pipelines solve it, fan out to vector search, lexical search, collaborative filtering, and trending simultaneously. Score with ML ensembles. Return the top 10 from 10,000 candidates. One API call. Under 50ms.

Time to read: 18 minutes | Includes: 5 code examples, 2 architecture diagrams, 1 comparison table

The Quadratic Wall
Why Stuffing Doesn’t Scale
The Retrieval Scaling Law
Part 1: The Naive Approach, Stuff Everything
Part 2: The Shaped Way, Rank Everything
Real-Time Behavioral Signals: The Missing Dimension
Comparison Table
FAQ

The Quadratic Wall

Here’s the math that governs every production agent deployment.

The self-attention mechanism in transformers computes attention scores between every pair of tokens in the input. For an input of length n, this requires O(n²) operations. In practice:

Context Tokens	Relative Compute Cost	Relative Latency
1,000	1x	~200ms
4,000	16x	~800ms
16,000	256x	~3.2s
64,000	4,096x	~12s+
128,000	16,384x	~30s+

These numbers are approximate and vary by model architecture, but the relationship holds: doubling the context quadruples the cost.

This isn’t a temporary limitation. It’s fundamental to how attention works. There are ongoing research efforts to reduce this (linear attention, sparse attention, ring attention), but in production today, with Claude, GPT-4, Gemini, you’re paying the quadratic tax on every token you pass in.

This means every irrelevant chunk in your context window is actively harmful. It:

Costs more compute, you’re paying quadratic cost for information the model doesn’t need
Increases latency, time-to-first-token scales with input length
Degrades quality, research on “lost in the middle” shows LLMs struggle to use information buried in long contexts
Reduces throughput, longer contexts mean fewer concurrent requests per GPU

The implication is counterintuitive: the best way to improve your agent is to give it less context, not more. But it has to be the right less.

Why Stuffing Doesn’t Scale

The default architecture for most RAG pipelines today looks like this:

User Query

↓

Vector Search
(top-k=200, cosine similarity)

↓

Stuff all 200 chunks
into system prompt

↓

LLM generates answer

This works in demos. It breaks in production. Here’s why.

Problem 1: Same results for every user

Vector search returns chunks by semantic similarity to the query. It doesn’t know who’s asking. A junior developer and a CTO asking the same question get the same 200 chunks, even though they need completely different levels of detail, different documents, and different context.

Problem 2: No learned relevance

Cosine similarity measures distance in embedding space. It does not measure usefulness. A chunk can be semantically similar to the query and completely unhelpful for the task. There’s no feedback loop, the system never learns which retrieved chunks actually led to good answers.

Problem 3: Retrieval and reranking are separate systems

To improve on raw vector search, teams bolt on a reranker (Cohere, Jina, a cross-encoder). Now you have:

A vector database (Pinecone, Weaviate, Qdrant)
A reranker API
Custom filtering logic
Maybe a BM25 index for keyword matching
Maybe a feature store for user context
Glue code to stitch it all together

Each system has its own latency, its own failure mode, its own deployment. The “simple” RAG pipeline is now five services deep.

Problem 4: No diversity control

Vector search returns the most similar results. Similar results are often redundant. If your top 10 chunks are all from the same document section, you’ve wasted 9 context slots on near-duplicate information. The LLM gets one perspective repeated 10 times instead of 10 diverse perspectives.

The enterprise reality

These problems compound at enterprise scale. We’ve spoken with engineering leaders at Fortune 500 companies who describe the same pattern: they deployed agents internally, usage exploded, and now they’re staring at inference bills that are growing faster than the value the agents provide.

The bottleneck isn’t the model. It’s what goes into the context window. Every enterprise deploying hundreds of AI workflows is discovering the same thing: context window management is the #1 infrastructure problem.

The Retrieval Scaling Law

In classic NLP and computer vision, scaling laws are about parameters. Bigger model = better performance. This is the Chinchilla/Kaplan framing that’s driven the race to trillion-parameter models.

But for retrieval, and for production agents that depend on retrieval, the scaling law is different. It’s about data freshness and ranking quality, not model depth.

Consider two retrieval systems:

System A: State-of-the-art embedding model (1B parameters), cosine similarity, batch-updated index, no personalization.

System B: Good embedding model (110M parameters), trained ranking model (LightGBM) on interaction data, real-time streaming index, per-user scoring.

System B wins. Every time. In every production deployment we’ve seen.

Why? Because the ranking model has learned what’s useful from actual user behavior. It knows that when user X asks about “authentication,” they usually need the API reference, not the conceptual overview. It knows that a document updated 2 hours ago is more relevant than one updated 6 months ago for this particular query. It knows that chunks from the “troubleshooting” section have a 3x higher helpfulness rate than chunks from “introduction.”

The embedding model doesn’t know any of this. It just measures semantic distance.

The retrieval scaling law: performance scales with signal freshness and ranking sophistication, not embedding model size.

This is the insight that separates production retrieval systems from demo RAG pipelines. Google figured this out 20 years ago for web search, raw text similarity was never enough. You need PageRank, click-through models, freshness signals, personalization. The same applies to agent context, except the data is bespoke to each company, not constant across the internet. Which makes it harder.

Part 1: The Naive Approach, Stuff Everything

Here’s what the standard RAG pipeline looks like in production code. We’ll walk through exactly why it breaks.

Architecture

User Query
→
Embed Query
(OpenAI API)
→
Vector DB
(Pinecone)
↓
top-k=200 chunks
↓
LLM (Claude)
generates answer
←
Stuff all 200 chunks
into prompt

Implementation

# naive_rag.py
import openai
import pinecone
import anthropic

# Step 1: Embed the query
def get_embedding(text: str) -> list[float]:
    response = openai.embeddings.create(
        model="text-embedding-3-small",
        input=text
    )
    return response.data[0].embedding

# Step 2: Retrieve top-k from vector DB
def retrieve_chunks(query: str, top_k: int = 200) -> list[dict]:
    embedding = get_embedding(query)
    
    results = pinecone_index.query(
        vector=embedding,
        top_k=top_k,
        include_metadata=True
    )
    
    return [
        {
            "content": match.metadata["content"],
            "source": match.metadata["source"],
            "score": match.score  # cosine similarity — NOT relevance
        }
        for match in results.matches
    ]

# Step 3: Stuff everything into the prompt
def answer_question(query: str) -> str:
    chunks = retrieve_chunks(query, top_k=200)
    
    # Concatenate all chunks into one massive context block
    context = "\n\n---\n\n".join([
        f"[{chunk['source']}]\n{chunk['content']}"
        for chunk in chunks
    ])
    
    # This context block can easily be 50,000+ tokens
    response = anthropic.messages.create(
        model="claude-sonnet-4-5-20250929",
        max_tokens=1024,
        system=f"Answer the user's question using ONLY the context below.\n\n{context}",
        messages=[{"role": "user", "content": query}]
    )
    
    return response.content[0].text

What you’re paying

Let’s do the math for a single query with 200 retrieved chunks averaging 250 tokens each:

Component	Tokens	Cost (Claude Sonnet)
System prompt + context	~50,000	~$0.15
User query	~50	negligible
Output	~500	~$0.0075
Total per query	~50,550	~$0.16

At 10,000 queries/day, that’s $1,600/day or $48,000/month in LLM inference alone. And that’s one agent.

Now what if you sent 10 ranked results instead of 200 unranked chunks?

Component	Tokens	Cost (Claude Sonnet)
System prompt + context	~2,500	~$0.0075
User query	~50	negligible
Output	~500	~$0.0075
Total per query	~3,050	~$0.015

That’s $150/day or $4,500/month. A 10x cost reduction, and the answers are better because the LLM isn’t drowning in irrelevant context.

The question is: how do you pick the right 10?

Part 2: The Shaped Way, Rank Everything

Shaped replaces the “retrieve 200, stuff everything” pattern with a multi-retriever pipeline that retrieves from multiple sources, filters, scores with trained ML models, and reorders for diversity, all in one API call, under 50ms.

Architecture

User Query
+ user_id + history
↓
SHAPED ENGINE
Vector Search
(50 candidates)
Lexical Search
(50 candidates)
Similarity
Collab (50)
Trending
Popular (50)
↓
FILTER
(rules + metadata)
SCORE
(ML model ensemble)
↓
REORDER (diversity)
↓
10 ranked results · <50ms p99
↓
LLM (Claude)
generates answer
receives 10 ranked chunks instead of 200 random ones

Implementation

Step 1: Define your engine

# agent_context_engine.yaml
data:
  item_table:
    name: knowledge_chunks        # Your docs, indexed with item_id
    type: table
  user_table:
    name: agent_users             # User profiles with user_id
    type: table
  interaction_table:
    type: query
    query: |
      SELECT user_id, item_id, created_at,
             was_helpful AS label  # Track which context helped the agent
      FROM context_feedback

index:
  embeddings:
    - name: text_embedding
      columns: [content, title, source]
      model: bge-large-en-v1.5
  lexical:
    - name: keyword_index
      columns: [content, title]

training:
  models:
    - name: helpfulness_score     # Learns what context is useful
      policy_type: lightgbm
    - name: elsa_collab           # Collaborative signal from interactions
      policy_type: elsa
      strategy: early_stopping

deployment:
  data_tier: fast_tier            # Redis-backed for <50ms latency

What this does:

interaction_table tracks which retrieved chunks actually helped the agent (thumbs up/down, task completion, user feedback). This is the signal that trains the ranking model.
helpfulness_score is a LightGBM model trained on this interaction data. It learns patterns like “chunks from the API reference section are 3x more helpful for user_8f3a than chunks from the overview.”
elsa_collab is a collaborative filtering model. It learns that users who found chunk A helpful also found chunk B helpful, even if A and B aren’t semantically similar.
fast_tier deploys the engine on Redis-backed infrastructure for sub-50ms query latency.

Step 2: Query with ShapedQL

-- Multi-retriever hybrid search with learned ranking
SELECT *
FROM
  -- Semantic search
  text_search(query='$query_text', mode='vector',
             text_embedding_ref='text_embedding', limit=50,
             name='vector_search'),
  -- Keyword search
  text_search(query='$query_text', mode='lexical',
             fuzziness=2, limit=50,
             name='lexical_search'),
  -- Items similar to what this user found helpful before
  similarity(embedding_ref='text_embedding',
            encoder='interaction_round_robin',
            input_user_id='$user_id', limit=50)
WHERE source_type IN ('doc', 'tool_output', 'memory')
ORDER BY score(
  expression='0.4 * retrieval.vector_search
    + 0.2 * retrieval.lexical_search
    + 0.4 * helpfulness_score',
  input_user_id='$user_id',
  input_interactions_item_ids='$interaction_item_ids'
)
REORDER BY diversity(diversity_lookback_window=15)
LIMIT 10

One query does what previously required five services:

Vector search, semantic similarity (replaces Pinecone query)
Lexical search, keyword matching (replaces BM25/Elasticsearch)
Collaborative similarity, “users like you found these helpful” (replaces custom feature engineering)
ML scoring, blends retrieval scores with a trained helpfulness model (replaces reranker API)
Diversity reorder, ensures the 10 results cover different topics, not 10 variations of the same chunk (replaces custom dedup logic)

Step 3: Wire it into your agent

import requests
import anthropic

SHAPED_API_KEY = "your-api-key"

def get_ranked_context(query: str, user_id: str, recent_docs: list[str]) -> list[dict]:
    """
    Retrieve ranked context from Shaped — 10 results from 10,000+ candidates.
    """
    response = requests.post(
        "https://api.shaped.ai/v2/engines/agent_context_engine/query",
        headers={
            "Content-Type": "application/json",
            "x-api-key": SHAPED_API_KEY
        },
        json={
            "query": """
              SELECT *
              FROM text_search(query='$query_text', mode='vector',
                      text_embedding_ref='text_embedding', limit=50, name='vs'),
                   text_search(query='$query_text', mode='lexical',
                      fuzziness=2, limit=50, name='ls'),
                   similarity(embedding_ref='text_embedding',
                      encoder='interaction_round_robin',
                      input_user_id='$user_id', limit=50)
              ORDER BY score(
                expression='0.4 * retrieval.vs + 0.2 * retrieval.ls
                  + 0.4 * helpfulness_score',
                input_user_id='$user_id',
                input_interactions_item_ids='$interaction_item_ids')
              REORDER BY diversity(diversity_lookback_window=15)
              LIMIT 10
            """,
            "parameters": {
                "query_text": query,
                "user_id": user_id,
                "interaction_item_ids": recent_docs
            },
            "return_metadata": True
        }
    )
    return response.json()["results"]


def answer_question(query: str, user_id: str, recent_docs: list[str]) -> str:
    """
    Answer a question with ranked context — 10 results, not 200.
    """
    results = get_ranked_context(query, user_id, recent_docs)

    # 10 ranked chunks ≈ 2,500 tokens instead of 50,000
    context = "\n\n".join([
        f"[{r['source']}] (relevance: {r['score']:.2f})\n{r['content']}"
        for r in results
    ])

    client = anthropic.Anthropic()
    response = client.messages.create(
        model="claude-sonnet-4-5-20250929",
        max_tokens=1024,
        system=f"Answer the user's question using the ranked context below. "
               f"Context is ordered by relevance — prioritize earlier items.\n\n{context}",
        messages=[{"role": "user", "content": query}]
    )

    return response.content[0].text

What changed

	Naive RAG	Shaped
Chunks sent to LLM	200 (unranked)	10 (ranked by trained ML)
Context tokens	~50,000	~2,500
Cost per query	~$0.16	~$0.015
Retrieval latency	~200ms (1 source)	<50ms (4 sources)
Personalized	No	Yes, per-user scoring
Learns from usage	No	Yes, continuous retraining
Diversity control	No	Yes, dedup + topic spread
Services to operate	3-5	1 API

Real-Time Behavioral Signals: The Missing Dimension

Here’s something that memory tools like Mem0, Letta, and Zep don’t capture: what’s happening right now.

Memory tools store and manage information from past conversations. They’re good at “this user prefers concise answers” or “we discussed the Q3 roadmap yesterday.” That’s valuable context.

But they don’t know what the user interacted with 30 seconds ago. They don’t know that across the platform, a specific document just spiked in views because of a production incident. They don’t know that users similar to this one have been clicking on “migration guide” content all morning.

These are real-time behavioral signals, and they’re the most predictive features for retrieval ranking.

Shaped’s connectors include real-time streaming sources (Kafka, Segment, Amplitude) that update interaction data continuously. When you define an interaction_table in your engine, Shaped tracks:

What this user interacted with in the last 30 seconds, last minute, last 5 minutes
What similar users are interacting with right now (collaborative signals)
What’s trending across the platform (popularity signals)
Recency of the content itself (freshness signals)

These signals feed directly into the ranking models. The helpfulness_score model retrains on a schedule, and the _derived_popular_rank column updates in real time.

This is the retrieval scaling law in action: keeping the model small but getting the data there faster. A LightGBM model with fresh signals beats a transformer reranker with stale data.

# Real-time connector example
data:
  interaction_table:
    name: agent_interactions
    type: table
    connector:
      type: segment               # Real-time streaming
      source_id: your_segment_source
      # Events appear in Shaped within 30 seconds

Combined with a query that uses these signals:

SELECT *
FROM text_search(query='$query_text', mode='vector',
                 text_embedding_ref='text_embedding', limit=50),
     column_order(columns='_derived_popular_rank ASC', limit=50)
ORDER BY score(
  expression='0.6 * helpfulness_score + 0.4 * elsa_collab',
  input_user_id='$user_id',
  input_interactions_item_ids='$interaction_item_ids'
)
LIMIT 10

The _derived_popular_rank retriever surfaces content that’s trending right now. The elsa_collab model captures what similar users are finding useful. The interaction_item_ids parameter passes the user’s recent interactions so the scoring model can personalize in real time.

Memory tools remember the past. Shaped ranks the present.

Comparison Table

	Stuff Everything (Naive RAG)	Vector DB + Reranker	Memory Layer (Mem0/Letta/Zep)	Shaped
What it does	Retrieve top-k, stuff into prompt	Retrieve + rerank	Store and manage agent memories	Multi-retriever ranked retrieval
Context tokens	50,000+	5,000-10,000	Varies	~2,500 (10 ranked results)
Cost per query	~$0.16	~$0.04-0.08	Depends on memory size	~$0.015
Ranking	Cosine similarity	Similarity + cross-encoder	Similarity + basic reranking	Trained ML models + ensembles
Personalized	No	No	User-scoped memory	Per-user ML scoring
Real-time signals	No	No	From conversations only	From all interaction data
Learns	No	No	From conversations	Continuous retraining on all data
Diversity	No	No	No	Yes, reorder step
Retrieval latency	50-200ms	200-500ms	200ms-1.4s	<50ms p99
Services to run	2 (vector DB + LLM)	3-5	2-3	1 API

Note: These approaches are not mutually exclusive. Shaped complements memory layers. Use Mem0, Letta, or Zep for conversation management (summarization, conflict resolution, long-term memory). Use Shaped for ranking everything your agent retrieves, documents, tools, memories, and knowledge. You can even index your memory store’s outputs as items in a Shaped engine and rank them alongside your other data sources.

FAQ

Q: Why not just use a bigger context window?

A: Because of the quadratic cost. Gemini and Claude now support 1M+ token windows, but using them is expensive and slow. A 200K context query can take 30+ seconds and cost $1+. More importantly, research shows LLMs perform worse with large amounts of irrelevant context, they get confused by noise, miss key details buried in the middle, and produce lower-quality answers. The better approach: send less context, but make it the right context.

Q: How does the helpfulness_score model know what’s “helpful”?

A: You define it through the interaction_table. Track signals like: did the user thumbs-up the response? Did they complete the task? Did they ask a follow-up (indicating the first answer was insufficient)? Did they click through to the source document? These signals become the label column that the LightGBM model trains on. Over time, it learns which documents, sections, and content types are most useful for which users and queries.

Q: What if I don’t have interaction data yet?

A: Start without the training block. Shaped works as a hybrid search engine out of the box, vector search + lexical search + filtering + diversity reorder. This alone is a significant upgrade over single-source vector search. As you accumulate interaction data, add the training models. Ranking quality improves continuously.

Q: How is this different from a reranker like Cohere or Jina?

A: A reranker takes a list of candidates and re-scores them using a cross-encoder. Shaped does that and more: it handles the initial retrieval (from multiple sources), filtering, scoring (with models trained on your data, not generic models), and diversity reordering, all in one system. A reranker is one step in the pipeline. Shaped is the whole pipeline.

Q: Can I use Shaped with my existing vector database?

A: Yes. Use the candidate_ids retriever to pass pre-retrieved IDs from your existing vector DB into Shaped for scoring and reranking. Or migrate your data into Shaped’s connectors and let it handle retrieval end-to-end. Both work.

Q: What about tool selection for agents?

A: Index your tool catalog as items in a Shaped engine. Each tool’s description, parameters, and usage examples become searchable fields. When the agent needs to select tools, query Shaped with the user’s intent, it returns the top 5 tools ranked by semantic match and past usage patterns. This scales to hundreds of tools without stuffing all descriptions into the context.

Q: Does this work for multi-agent architectures?

A: Yes. Each agent can query the same Shaped engine with different parameters. Agent A retrieves context for code review, Agent B retrieves context for customer support, Agent C retrieves context for data analysis, all from the same underlying data, ranked differently based on each agent’s interaction patterns.

Conclusion

The scaling law for production agents isn’t about bigger models or longer context windows. It’s about what goes into the context window, ranked by learned signals, personalized to the user, drawn from multiple retrieval sources, and updated in real time.

The math is simple: 10 ranked results in 2,500 tokens outperforms 200 unranked chunks in 50,000 tokens. Every time. At 10x lower cost. With better answers.

Context window management is the infrastructure problem of the agent era. Solve it, and everything downstream improves: accuracy, latency, cost, user satisfaction. Ignore it, and you’re paying quadratic cost for linear (or negative) value.

Context Window Optimization: Why Ranking, Not Stuffing, Is the Scaling Law for Agents

Quick Answer: Why Your Agent Is Slow and Expensive

Table of Contents

The Quadratic Wall

Why Stuffing Doesn’t Scale

Problem 1: Same results for every user

Problem 2: No learned relevance

Problem 3: Retrieval and reranking are separate systems

Problem 4: No diversity control

The enterprise reality

The Retrieval Scaling Law

Part 1: The Naive Approach, Stuff Everything

Architecture

Implementation

What you’re paying

Part 2: The Shaped Way, Rank Everything

Architecture

Implementation

What changed

Real-Time Behavioral Signals: The Missing Dimension

Comparison Table

FAQ

Conclusion

Get up and running with one engineer in one sprint

Related Posts

10 Best Practices in Data Ingestion: A Scalable Framework for Real-Time, Reliable Pipelines

5 Best APIs for Adding Personalized Recommendations to Your App in 2025

Action is All You Need: Dual-Flow Generative Ranking Network for Recommendation