The Challenge: Context-Aware AI Agents in Production
You've built a RAG-powered AI agent. It answers questions brilliantly—until it recommends the same restaurant twice, suggests meat dishes to a vegan user, or ignores that your VIP customer has a $10,000 budget while a first-time visitor has $100.
The problem? Most RAG systems are stateless. They retrieve relevant documents but ignore who the user is, what they've done, and what they actually care about.
In this guide, we'll explore:
- How production teams currently handle user state in RAG systems
- The architectural tradeoffs of traditional vs. unified approaches
- A practical implementation guide for stateful agents
- When you need behavioral personalization vs. simple context
How Most Teams Build Stateful RAG Today
Based on production architectures from companies building AI agents in 2026, here's the common pattern:
The Traditional Multi-Component Stack
Vector Storage Layer (Long-term Memory)
- Popular choices: Pinecone, Weaviate, Qdrant, Milvus, Chroma
- What it stores: Document embeddings, product catalogs, knowledge bases
- Strength: Fast semantic search at scale
- Limitation: No understanding of user behavior or preferences
Session/Profile Storage (Short-term Memory)
- Popular choices: Redis, PostgreSQL with JSONB, MongoDB, DynamoDB
- What it stores: User preferences, session state, interaction history
- Strength: Fast key-value lookups
- Limitation: Disconnected from retrieval logic
Reranking Layer (Relevance Intelligence)
- Popular choices: Cohere Rerank, LangChain contextual compression, custom models
- What it does: Reorders results based on semantic similarity to query
- Strength: Improves precision without reindexing
- Limitation: Primarily linguistic—limited behavioral awareness
Orchestration
- Popular choices: LangChain, LlamaIndex, custom Python/TypeScript
- What it does: Fetches user state, queries vectors, reranks, prompts LLM
- Limitation: You maintain all synchronization logic
Architecture Diagram: Traditional Approach
Three Production Challenges with Traditional Architectures
1. The Synchronization Tax
Problem: User state lives separately from retrieval logic.
Real scenario:
- User clicks on a luxury hotel at 2:37 PM
- You must manually:
- Update Redis with the interaction
- Query vector DB with exclusion filter (NOT hotel_id = 'clicked_item')
- Somehow bias future results toward "luxury" tier
Code reality:
2. Linguistic vs. Behavioral Mismatch
Problem: Rerankers optimize for semantic similarity, not user intent.
Example:
- Query: "dinner recommendations"
- Vector search returns: Italian, Japanese, Steakhouse (all semantically relevant)
- User history: Clicked 5 vegan restaurants, never clicked meat-heavy options
- Cohere rerank: Still surfaces Steakhouse because the description semantically matches "dinner"
Why? Traditional rerankers are trained on linguistic datasets (MS MARCO, Natural Questions), not behavioral data (clicks, purchases, time-on-page).
3. Positional Bias in Context Windows
Problem: LLMs suffer from "lost in the middle" effects.
When you textify user state and prepend it to the prompt:
System: User is vegan, prefers budget-friendly, lives in Brooklyn.
[49 restaurant documents inserted here]
User query: "Where should I eat tonight?"
If the perfect match (vegan Brooklyn spot) is document #31, the LLM may ignore it in favor of documents at positions 1-5 or 45-49.
Alternative Approach: Unified Retrieval with Behavioral Ranking
The Concept
Instead of separating vector search, user state, and reranking into distinct systems, some modern platforms combine them into a unified retrieval layer that natively understands both content and behavior.
Key difference: Ranking is done using behavioral ML models (collaborative filtering, two-tower networks, learning-to-rank) trained on actual user interactions, not just semantic similarity.
Architecture Comparison
Implementation Guide: Building a Stateful Agent
We'll use Shaped as the example unified platform, but the concepts apply to building custom solutions.
Step 1: Define Your Data Schema
Connect three types of data using YAML schema files:
Create the tables via CLI:
Step 2: AI Enrichment (Optional but Powerful)
Use AI Views to materialize high-level user intent from raw interaction logs. Create a view definition:
Create the view:
Output example:
Why this matters: Instead of passing 20 raw click events to your LLM prompt, you pass a concise summary. Reduces tokens, improves signal.
Step 3: Train a Behavioral Ranking Model
Define a ranking model that learns the relationship between user state and items:
Create the engine:
What this does:
- Learns that users who clicked luxury Italian restaurants are likely to engage with similar venues
- Automatically handles cold-start (new users) by falling back to content similarity
- Updates in real-time as new interactions stream in
Check engine status:
Wait for status to be ACTIVE (may take 30 minutes to several hours depending on data size).
Step 4: Real-Time State Updates
When a user interacts with an item, stream the event to Shaped. If using real-time connectors (Kafka, Kinesis, Segment), events are automatically streamed. For custom events, insert via API:
Or via REST API:
What happens automatically:
- Engine indexes the new interaction within 30 seconds (real-time connectors)
- User's latent representation updates for personalized queries
- Popularity scores recalculate
- No manual cache invalidation needed
Step 5: Querying with Behavioral + Semantic Ranking
Use ShapedQL to combine filters, semantic search, and behavioral ranking:
Execute via CLI:
Or via REST API:
Key benefits:
- Deterministic exclusions: Physically impossible to recommend seen items
- Behavioral ranking: User affinity score is learned from actual engagement
- Semantic fallback: Text similarity handles novel queries
- Single query: No multi-hop fetches
Step 6: Integration with Your Agent
When Do You Need Behavioral Personalization?
Not every RAG system needs behavioral ranking. Here's a decision framework:
Use Traditional RAG (Vector + Rerank) When:
- ✅ Queries are knowledge-seeking ("What is GDPR?")
- ✅ Users don't have interaction history (documentation search)
- ✅ Personalization is rule-based ("Show items in user's region")
- ✅ You have <10K items
Use Stateful RAG (Behavioral + Semantic) When:
- ✅ Queries are preference-driven ("What should I watch/buy/eat?")
- ✅ You have rich interaction data (clicks, purchases, time-on-page)
- ✅ Users expect personalization ("more like this")
- ✅ Catalog is large (>100K items) and dynamic
Performance Benchmarks
Based on production use cases:
Common Pitfalls to Avoid
- Over-personalizing too early: Start with good semantic search. Add behavioral ranking when you have >1000 users with >5 interactions each.
- Ignoring cold-start: Always have a fallback for new users (popularity, content similarity).
- Not excluding seen items: This is the #1 user complaint. Make exclusions declarative, not conditional logic.
- Treating all interactions equally: A purchase is worth more than a click. Weight your events.
- Forgetting to A/B test: Measure business metrics (retention, conversion), not just relevance scores.
FAQ: Stateful RAG Architecture
Q: Can I build this with open-source tools?
A: Yes. Use PostgreSQL (pgvector + JSONB for state), Milvus/Weaviate (vectors), and train a custom two-tower model with TensorFlow Recommenders or PyTorch. Requires more engineering effort but fully possible.
Q: How much interaction data do I need?
A: Minimum ~1000 users with ~5-10 interactions each to train a meaningful behavioral model. Below that, stick with rules and semantic search.
Q: Does this work for B2B SaaS agents?
A: Absolutely. Instead of "clicked restaurant," track "viewed documentation page," "used feature," "asked question about X." Same principles apply.
Q: What about privacy/GDPR?
A: User interaction data should be pseudonymized (hash user IDs). Most platforms allow data deletion via API. Shaped specifically offers GDPR-compliant data handling.
Q: Can I use this with RAG frameworks like LangChain?
A: Yes. You can replace LangChain's vector store retriever with a custom retriever that calls your behavioral ranking API.
Next Steps
- Audit your current stack: Map out where user state lives, how you fetch it, and how you synchronize it with retrieval.
- Instrument interactions: Start logging clicks, time-on-page, purchases. Even simple events unlock personalization.
- Run an A/B test: Compare semantic-only ranking vs. behavioral ranking on a small segment.
- Explore platforms: Try Shaped's free trial with $300 credits or build a proof-of-concept with open-source tools.
Conclusion: From Stateless to Stateful RAG
The future of AI agents isn't just better prompts or smarter LLMs—it's context-aware retrieval. By moving user state and behavioral ranking into your retrieval layer, you build agents that:
- Remember user preferences without prompt engineering
- Surface high-signal content in <100ms
- Respect business constraints deterministically
- Improve automatically as users interact
Whether you choose a unified platform or build it yourself, the principle remains: Stateful beats stateless when users expect personalization.
Additional Resources:



