Building Stateful AI Agents: Why User History Matters in RAG Systems (2026 Guide)

Most RAG systems are stateless—they retrieve relevant documents but ignore who the user is and what they've done. This creates a fundamental problem: your AI agent recommends the same restaurant twice, suggests meat to vegans, or treats VIP customers like first-time visitors. This guide explains how production teams build stateful RAG systems that remember user preferences and behavior, comparing traditional multi-component architectures (vector DB + Redis + reranker) against modern unified retrieval approaches. Includes working code examples, architectural diagrams, performance benchmarks, and a decision framework for when you actually need behavioral personalization.

The Challenge: Context-Aware AI Agents in Production

You've built a RAG-powered AI agent. It answers questions brilliantly—until it recommends the same restaurant twice, suggests meat dishes to a vegan user, or ignores that your VIP customer has a $10,000 budget while a first-time visitor has $100.

The problem? Most RAG systems are stateless. They retrieve relevant documents but ignore who the user is, what they've done, and what they actually care about.

In this guide, we'll explore:

  • How production teams currently handle user state in RAG systems
  • The architectural tradeoffs of traditional vs. unified approaches
  • A practical implementation guide for stateful agents
  • When you need behavioral personalization vs. simple context

How Most Teams Build Stateful RAG Today

Based on production architectures from companies building AI agents in 2026, here's the common pattern:

The Traditional Multi-Component Stack

Vector Storage Layer (Long-term Memory)

  • Popular choices: Pinecone, Weaviate, Qdrant, Milvus, Chroma
  • What it stores: Document embeddings, product catalogs, knowledge bases
  • Strength: Fast semantic search at scale
  • Limitation: No understanding of user behavior or preferences

Session/Profile Storage (Short-term Memory)

  • Popular choices: Redis, PostgreSQL with JSONB, MongoDB, DynamoDB
  • What it stores: User preferences, session state, interaction history
  • Strength: Fast key-value lookups
  • Limitation: Disconnected from retrieval logic

Reranking Layer (Relevance Intelligence)

  • Popular choices: Cohere Rerank, LangChain contextual compression, custom models
  • What it does: Reorders results based on semantic similarity to query
  • Strength: Improves precision without reindexing
  • Limitation: Primarily linguistic—limited behavioral awareness

Orchestration

  • Popular choices: LangChain, LlamaIndex, custom Python/TypeScript
  • What it does: Fetches user state, queries vectors, reranks, prompts LLM
  • Limitation: You maintain all synchronization logic

Architecture Diagram: Traditional Approach

User Query → [App Logic]
Fetch
Profile
(user is
vegan)
Redis/PostgreSQL
(user_prefs table)
Vector
Search
(50 docs)
Pinecone/Weaviate
(restaurant_catalog)
Rerank
(Cohere)
Cohere API
Prompt
LLM
Output

Three Production Challenges with Traditional Architectures

1. The Synchronization Tax

Problem: User state lives separately from retrieval logic.

Real scenario:

  • User clicks on a luxury hotel at 2:37 PM
  • You must manually:
    1. Update Redis with the interaction
    2. Query vector DB with exclusion filter (NOT hotel_id = 'clicked_item')
    3. Somehow bias future results toward "luxury" tier

Code reality:

manual_glue_code.py
# You write and maintain this glue code
async def get_recommendations(user_id: str, query: str):
    # Fetch user history from Redis
    profile = await redis.get(f"user:{user_id}:profile")
    recent_clicks = await redis.lrange(f"user:{user_id}:clicks", 0, 20)
    
    # Query vector DB with exclusions
    results = await pinecone.query(
        vector=embed(query),
        filter={"id": {"$nin": recent_clicks}},  # Manual exclusion
        top_k=50
    )
    
    # Rerank (still doesn't know user clicked luxury items)
    reranked = await cohere.rerank(
        query=f"{query}. User preferences: {profile}",  # Textified state
        documents=results
    )
    
    return reranked[:5]

2. Linguistic vs. Behavioral Mismatch

Problem: Rerankers optimize for semantic similarity, not user intent.

Example:

  • Query: "dinner recommendations"
  • Vector search returns: Italian, Japanese, Steakhouse (all semantically relevant)
  • User history: Clicked 5 vegan restaurants, never clicked meat-heavy options
  • Cohere rerank: Still surfaces Steakhouse because the description semantically matches "dinner"

Why? Traditional rerankers are trained on linguistic datasets (MS MARCO, Natural Questions), not behavioral data (clicks, purchases, time-on-page).

3. Positional Bias in Context Windows

Problem: LLMs suffer from "lost in the middle" effects.

When you textify user state and prepend it to the prompt:

System: User is vegan, prefers budget-friendly, lives in Brooklyn.

[49 restaurant documents inserted here]

User query: "Where should I eat tonight?"

If the perfect match (vegan Brooklyn spot) is document #31, the LLM may ignore it in favor of documents at positions 1-5 or 45-49.

Alternative Approach: Unified Retrieval with Behavioral Ranking

The Concept

Instead of separating vector search, user state, and reranking into distinct systems, some modern platforms combine them into a unified retrieval layer that natively understands both content and behavior.

Key difference: Ranking is done using behavioral ML models (collaborative filtering, two-tower networks, learning-to-rank) trained on actual user interactions, not just semantic similarity.

Architecture Comparison

ComponentTraditional StackUnified Approach
Vector SearchPinecone/WeaviateIntegrated
User StateRedis/PostgreSQLIntegrated (fast tier)
RankingCohere RerankBehavioral ML
ExclusionsManual logicSQL-like rules
State SyncYour codeAutomatic
Latency3-4 network hops1 query

Implementation Guide: Building a Stateful Agent

We'll use Shaped as the example unified platform, but the concepts apply to building custom solutions.

Step 1: Define Your Data Schema

Connect three types of data using YAML schema files:

users_schema.yaml
name: users
schema_type: CUSTOM
column_schema:
  user_id: String
  signup_date: DateTime
  subscription_tier: String # free, pro, enterprise
  preferences: String       # JSON string with dietary restrictions, interests
restaurants_schema.yaml
name: restaurants
schema_type: CUSTOM
column_schema:
  item_id: String     # Note: Must be named item_id for Shaped engines
  name: String
  cuisine: String
  price_tier: Int32   # 1-4 ($-$$$$)
  description: String
interactions_schema.yaml
name: interactions
schema_type: CUSTOM
column_schema:
  user_id: String
  item_id: String
  interaction_type: String # clicked, booked, reviewed
  created_at: DateTime
  label: Int32             # 1 for positive interaction, 0 for negative

Create the tables via CLI:

terminal
$ shaped create-table --file users_schema.yaml
$ shaped create-table --file restaurants_schema.yaml
$ shaped create-table --file interactions_schema.yaml

Step 2: AI Enrichment (Optional but Powerful)

Use AI Views to materialize high-level user intent from raw interaction logs. Create a view definition:

user_taste_profile_view.yaml
# user_taste_profile_view.yaml
name: user_taste_profile
view_type: AI
source_table: interactions
prompt: |
  Analyze this user's last 20 restaurant interactions.
  Extract:
  1. Cuisine preferences (ranked)
  2. Price sensitivity (budget, moderate, luxury)
  3. Dining occasion (casual, date night, business)

  Return as JSON with these exact keys: cuisine_preferences (array), 
  price_sensitivity (string), dining_occasions (array).
output_columns:
  - name: user_id
    type: String
  - name: cuisine_preferences
    type: String
  - name: price_sensitivity
    type: String
  - name: dining_occasions
    type: String

Create the view:

terminal
$ shaped create-view --file user_taste_profile_view.yaml

Output example:

output.json
{
    "user_id": "user_123",
    "cuisine_preferences": "[\"Italian\", \"Japanese\", \"Mediterranean\"]",
    "price_sensitivity": "luxury",
    "dining_occasions": "[\"date_night\", \"celebration\"]"
}

Why this matters: Instead of passing 20 raw click events to your LLM prompt, you pass a concise summary. Reduces tokens, improves signal.

Step 3: Train a Behavioral Ranking Model

Define a ranking model that learns the relationship between user state and items:

restaurant_engine.yaml
# restaurant_engine.yaml
name: restaurant_recommendation_engine

data:
  item_table:
    name: "restaurants"
    type: table
  user_table:
    name: "user_taste_profile"
    type: table
  interaction_table:
    name: "interactions"
    type: table

training:
  models:
    - name: user_affinity_model
      policy_type: elsa  # Two-tower architecture, good for personalization

Create the engine:

terminal
$ shaped create-engine --file restaurant_engine.yaml

What this does:

  • Learns that users who clicked luxury Italian restaurants are likely to engage with similar venues
  • Automatically handles cold-start (new users) by falling back to content similarity
  • Updates in real-time as new interactions stream in

Check engine status:

terminal
$ shaped list-engines

Wait for status to be ACTIVE (may take 30 minutes to several hours depending on data size).

Step 4: Real-Time State Updates

When a user interacts with an item, stream the event to Shaped. If using real-time connectors (Kafka, Kinesis, Segment), events are automatically streamed. For custom events, insert via API:

terminal
# Via CLI (for testing)
$ echo "user_123,restaurant_456,clicked,2025-02-10T14:30:00,1" >> new_interaction.csv
$ shaped table-insert --table-name interactions --file new_interaction.csv --type csv

Or via REST API:

stream_interactions.py
import requests

requests.post(
    "https://api.shaped.ai/v2/tables/interactions/insert",
    headers={"x-api-key": "YOUR_API_KEY"},
    json={
        "rows": [{
            "user_id": "user_123",
            "item_id": "restaurant_456",
            "interaction_type": "clicked",
            "created_at": "2025-02-10T14:30:00Z",
            "label": 1
        }]
    }
)

What happens automatically:

  • Engine indexes the new interaction within 30 seconds (real-time connectors)
  • User's latent representation updates for personalized queries
  • Popularity scores recalculate
  • No manual cache invalidation needed

Step 5: Querying with Behavioral + Semantic Ranking

Use ShapedQL to combine filters, semantic search, and behavioral ranking:

recommendations.sql
-- Query personalized recommendations with business rules
SELECT
    item_id,
    name,
    cuisine,
    price_tier
FROM similarity(
    embedding_ref = 'user_affinity_model',
    encoder = 'precomputed_user',
    input_user_id = $user_id,
    limit = 100
)
WHERE
    -- Hard business rules (deterministic)
    prebuilt('exclude_seen', input_user_id = $user_id)
    AND price_tier <= $max_budget
    AND is_open_now = true
ORDER BY
    -- Blend behavioral model score with text similarity
    score(
        expression = '0.7 * user_affinity_model + 0.3 * text_similarity',
        input_user_id = $user_id
    )
LIMIT 5

Execute via CLI:

terminal
$ shaped query \
  --engine-name restaurant_recommendation_engine \
  --query "SELECT * FROM similarity(embedding_ref='user_affinity_model', encoder='precomputed_user', input_user_id=\$user_id, limit=100) WHERE prebuilt('exclude_seen', input_user_id=\$user_id) LIMIT 5" \
  --parameters '{"user_id": "user_123"}'

Or via REST API:

terminal
$ curl https://api.shaped.ai/v2/engines/restaurant_recommendation_engine/query \
  -X POST \
  -H "x-api-key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "query": "SELECT * FROM similarity(embedding_ref=''user_affinity_model'', encoder=''precomputed_user'', input_user_id=$user_id, limit=100) WHERE prebuilt(''exclude_seen'', input_user_id=$user_id) LIMIT 5",
    "parameters": {"user_id": "user_123"}
  }'

Key benefits:

  1. Deterministic exclusions: Physically impossible to recommend seen items
  2. Behavioral ranking: User affinity score is learned from actual engagement
  3. Semantic fallback: Text similarity handles novel queries
  4. Single query: No multi-hop fetches

Step 6: Integration with Your Agent

agent.ts
// TypeScript example using fetch
import Anthropic from '@anthropic-ai/sdk';

const anthropic = new Anthropic({ apiKey: process.env.ANTHROPIC_API_KEY });

async function runAgent(userId: string, query: string) {
    // Get personalized, state-aware results from Shaped
    const response = await fetch(
        'https://api.shaped.ai/v2/engines/restaurant_recommendation_engine/query',
        {
            method: 'POST',
            headers: {
                'Content-Type': 'application/json',
                'x-api-key': process.env.SHAPED_API_KEY!
            },
            body: JSON.stringify({
                query: `SELECT * FROM similarity(embedding_ref='user_affinity_model', encoder='precomputed_user', input_user_id=$user_id, limit=100) WHERE prebuilt('exclude_seen', input_user_id=$user_id) LIMIT 5`,
                parameters: { user_id: userId },
                return_metadata: true
            })
        }
    );

    const { results } = await response.json();

    // Pass high-signal context to LLM
    const completion = await anthropic.messages.create({
        model: "claude-sonnet-4-5-20250929",
        messages: [{
            role: "user",
            content: `Based on these personalized recommendations:
             ${JSON.stringify(results, null, 2)}

             Answer the user's query: "${query}"`
        }]
    });

    return completion.content;
}

// Usage
const response = await runAgent("user_123", "Where should I eat tonight?");
console.log(response);

When Do You Need Behavioral Personalization?

Not every RAG system needs behavioral ranking. Here's a decision framework:

Use Traditional RAG (Vector + Rerank) When:

  • ✅ Queries are knowledge-seeking ("What is GDPR?")
  • ✅ Users don't have interaction history (documentation search)
  • ✅ Personalization is rule-based ("Show items in user's region")
  • ✅ You have <10K items

Use Stateful RAG (Behavioral + Semantic) When:

  • ✅ Queries are preference-driven ("What should I watch/buy/eat?")
  • ✅ You have rich interaction data (clicks, purchases, time-on-page)
  • ✅ Users expect personalization ("more like this")
  • ✅ Catalog is large (>100K items) and dynamic

Performance Benchmarks

Based on production use cases:

MetricTraditional StackUnified Behavioral
Query Latency200-400ms50-100ms
Precision@50.42 (semantic)0.68 (behavioral)
Cold StartManual fallbackAuto content-based
UpdatesCustom pipelineNative event stream

Common Pitfalls to Avoid

  1. Over-personalizing too early: Start with good semantic search. Add behavioral ranking when you have >1000 users with >5 interactions each.
  2. Ignoring cold-start: Always have a fallback for new users (popularity, content similarity).
  3. Not excluding seen items: This is the #1 user complaint. Make exclusions declarative, not conditional logic.
  4. Treating all interactions equally: A purchase is worth more than a click. Weight your events.
  5. Forgetting to A/B test: Measure business metrics (retention, conversion), not just relevance scores.

FAQ: Stateful RAG Architecture

Q: Can I build this with open-source tools?
A: Yes. Use PostgreSQL (pgvector + JSONB for state), Milvus/Weaviate (vectors), and train a custom two-tower model with TensorFlow Recommenders or PyTorch. Requires more engineering effort but fully possible.

Q: How much interaction data do I need?
A: Minimum ~1000 users with ~5-10 interactions each to train a meaningful behavioral model. Below that, stick with rules and semantic search.

Q: Does this work for B2B SaaS agents?
A: Absolutely. Instead of "clicked restaurant," track "viewed documentation page," "used feature," "asked question about X." Same principles apply.

Q: What about privacy/GDPR?
A: User interaction data should be pseudonymized (hash user IDs). Most platforms allow data deletion via API. Shaped specifically offers GDPR-compliant data handling.

Q: Can I use this with RAG frameworks like LangChain?
A: Yes. You can replace LangChain's vector store retriever with a custom retriever that calls your behavioral ranking API.

Next Steps

  1. Audit your current stack: Map out where user state lives, how you fetch it, and how you synchronize it with retrieval.
  2. Instrument interactions: Start logging clicks, time-on-page, purchases. Even simple events unlock personalization.
  3. Run an A/B test: Compare semantic-only ranking vs. behavioral ranking on a small segment.
  4. Explore platforms: Try Shaped's free trial with $300 credits or build a proof-of-concept with open-source tools.

Conclusion: From Stateless to Stateful RAG

The future of AI agents isn't just better prompts or smarter LLMs—it's context-aware retrieval. By moving user state and behavioral ranking into your retrieval layer, you build agents that:

  • Remember user preferences without prompt engineering
  • Surface high-signal content in <100ms
  • Respect business constraints deterministically
  • Improve automatically as users interact

Whether you choose a unified platform or build it yourself, the principle remains: Stateful beats stateless when users expect personalization.

Additional Resources:

Get up and running with one engineer in one sprint

Guaranteed lift within your first 30 days or your money back

100M+
Users and items
1000+
Queries per second
1B+
Requests

Related Posts

Nic Scheltema
 | 
November 7, 2024

How to Implement Effective Caching Strategies for Recommender Systems

Tullie Murrell
 | 
May 14, 2025

Optimizing Video Recommendation Systems: A Deep Dive into Tweedie Regression for Predicting Watch Time (Tubi Case Study)

Heorhii Skovorodnikov
 | 
October 16, 2023

RAG for RecSys: a magic formula?