Reading progress

Why grep is beating your Vector DB

Information retrieval for AI agents is shifting from a static, human-designed pipeline into a dynamic control plane that agents operate themselves. This post lays out that vision: why naive RAG is giving way to iterative, heterogeneous retrieval; how the stack is splitting into ephemeral sandboxes, specialized storage primitives, and a unified orchestration layer; and what an agent-native data control plane needs to look like so agents can provision and evolve their own retrieval infrastructure, query by query.

Why grep is beating your Vector DB

I’ve been thinking a lot lately about how information retrieval is fundamentally shifting for AI agents.

If you look at the trenches of AI engineering right now, the data layer is a bit of a mess. For the last couple of years, the industry has been running on naive RAG. You take your documents, slice them into chunks, push them through a frozen embedding model like text-embedding-3, dump them into a vector database, and perform a cosine similarity search whenever a user asks a question.

That worked, until agents got smarter and context windows exploded.

Today, an agent typically has a suite of retrieval tools and decides which ones to invoke based on the incoming query. A RAG pipeline is no longer the whole solution; it’s just one tool in the kit, sitting alongside BM25 text indices, web search, or direct API calls.

When I map this to a traditional search pipeline, two major shifts stand out:

  1. The shift in Query Understanding: In early RAG, we pushed query understanding into the vector store itself. It would return the top-k similar candidates, and the generative model would act as a summarizer.
  2. The rise of Iterative Retrieval: In modern agentic retrieval (especially since the release of models like Claude Sonnet 4.5, Sept ‘25), the model handles the query understanding upfront. It decomposes the intent and orchestrates multiple retrieval calls across various systems. The results are then reranked together. This process is often iterative: the agent judges the retrieved results and decides whether it needs to go back for more. While this is slower and higher-latency, it acts as the ultimate reranker. If the goal is the right answer, the agentic loop is hard to beat.
Retrieval Then vs Now: Traditional RAG versus iterative agentic retrievalSide-by-side comparison. On the left, a linear five-step traditional RAG pipeline: Query, Embed, Vector DB top-k, LLM summarize, Answer. On the right, an iterative multi-tool agentic loop: Query, Agent decomposes intent, parallel retrieval across Vector, BM25, grep/AST, and APIs, Rerank, an “enough context?” judge step, and then either Answer or loop back for more retrieval.Retrieval: Then vs NowFrom a single-shot vector lookup to an iterative, multi-tool loopTRADITIONAL RAGsingle retrieval call · linearQueryEmbedVector DB · top-kLLM · summarizeAnswer1 tool · 1 call · no feedbackITERATIVE AGENTIC RETRIEVALmulti-tool · rerank · judge loopQueryAgent · decompose intentVectorBM25grep/ASTAPIsRerankEnough context?yesAnswerno · loopN tools · N calls · judge-driven feedback

Because of this shift, we’re seeing a demand for a highly heterogeneous data layer. Vector stores were supposed to be the “long-term memory” for AI, but it turns out semantic understanding is just one component of the stack.

In fact, grep is arguably the dominant retrieval system right now. If you look at the internals of Cursor or Claude Code, they rely heavily on grep for codebase retrieval. This makes perfect sense: code is inherently lexical. When you need to find a specific variable, function name, or file path, exact string matching is exactly what you want. Coupled with a Language Server Protocol for finding references and traversing code structure, even for massive codebases, code search agents don’t get as much from semantic indexes.

So why is grep winning? It’s not necessarily that it’s better than a vector store, it’s just frictionless. For local file use cases, it requires no indexing and thrives on simplicity. Furthermore, if we are honest about the long tail of retrieval use-cases, they aren’t enterprise-scale document lakes. A massive portion of agentic memory is actually just small sets of markdown files, like a local Claude Code setup or an OpenAPI spec. For these use cases, raw grep is perfectly sufficient. It’s a portable tool that any agent with CLI access can use. It is the ultimate retrieval starting point.

Setting up a traditional vector database for an agentic application is an operations nightmare. To get even a simple “memory” working, an engineer has to:

  • Choose an embedding model: Weighing the latency of local models vs. the cost and rate limits of API-based ones.
  • Design a chunking strategy: Making arbitrary, brittle decisions about character counts, recursive headers, and overlap percentages.
  • Build a sync pipeline: Ensuring that every time a file changes in a GitHub repo or a Slack message is sent, the vector store updates without creating duplicates or stale indices.

grep sidesteps all of this. It’s deterministic, instant, and, with 1M+ token context windows, you can just dump the raw results in and let the model’s attention mechanism do the heavy lifting.

That all being said, raw grep obviously isn’t a silver bullet. Autonomous agents frequently write bad regex or accidentally search through massive node_modules folders, blowing out their context window with garbage. Furthermore, relying on massive context windows is expensive and slow. Forcing a frontier model to parse 500,000 tokens of raw file output iteratively will burn through your API credits and spike your latency, while increasing the risk of “lost in the middle” phenomena.

This is where advanced retrieval steps back in. Pruning a search space down to highly relevant, targeted pieces of data using a vector store or search engine remains crucial for cost, scale, and latency control.

Agents as Data Architects

Right now, we are comfortable with agents operating at the query layer, an engineer sets up BM25, a vector store, and AST tools, and the agent simply routes between them. But this still relies on a human making static, upfront decisions about the data infrastructure.

I’m a big believer in self-replicating software: the idea that agents shouldn’t just use software to solve tasks; they should build the software required to solve them. When you apply this perspective to agentic retrieval, the logical conclusion is that agents will eventually be the ones choosing, provisioning, and maintaining their own data infrastructure.

Historically, humans choose one data infra, index the data once, and the system is locked in. It is fundamentally stateless. But if you give an agent a sandbox where it is responsible for solving the retrieval problem end-to-end, the data infrastructure becomes stateful and highly dynamic.

Imagine an agent dropped into a massive new codebase. For the first few queries, it might just use grep. But as it works, it notices it’s repeatedly searching for relational dependencies that are eating up its context window. Instead of throwing an error or blindly continuing to grep, the agent writes a script to parse the AST, extracts the relationships, and spins up a temporary knowledge graph in its sandbox. Or, if it gets asked broad conceptual questions about a repository’s documentation, it might autonomously decide to embed those specific markdown files and spin up a lightweight local vector index.

In this world, the index is no longer a static artifact. Every query that comes in defines how the internal data system is improved. The agent uses its knowledge of previous queries, and the bottlenecks it faced, to continuously refine how data is stored for subsequent queries.

Information retrieval is moving from a static pipeline to a dynamic decision tree, where the agent builds and provisions the right tool on the fly based on the data:

When the agent sees…It reaches for…Because
Local files or a few markdown notesgrepNo indexing, instant, deterministic
Code references & call sitesAST / LSP indexBeats fuzzy search for code
Exact-term recall over many docsBM25Keyword recall at scale
Fuzzy, conceptual questionsVector DB (Pinecone, Weaviate)Semantic similarity
Multi-hop entity relationshipsKnowledge graph (Neo4j)Relational traversal

Just as human engineers graduate from grep when a project becomes too vast or complex, agents will graduate too. The only difference is, they’ll be the ones writing the migration.

The Three Layers of Agentic Retrieval Infrastructure

For agents to actually operate as data architects navigating this decision tree, the underlying infrastructure must shift from rigid, pre-configured silos to flexible building blocks.

(To be clear, I am separating out external search APIs like Exa and verticalized SaaS like Glean or Algolia. While incredibly powerful, they serve as complete “black box” endpoints or human-facing applications. When we talk about agents constructing and managing their own dynamic memory, we are talking strictly about the infrastructure layer.)

The three layers of agentic retrieval infrastructureA vertical stack diagram. At the top sits the Agent / LLM. Layer 3, highlighted as the control plane, is Unified Orchestration and Routing, featuring Shaped, Mem0, and unified API gateways. Layer 2 is Specialized Storage Primitives, featuring Pinecone, Weaviate, Elasticsearch, OpenSearch, and Neo4j. Layer 1 is Ephemeral Sandboxes, featuring DuckDB, SQLite, in-memory FAISS, AST/LSP, and grep. At the bottom, Data Sources accessed via MCP: Shopify, Slack, Amplitude, GitHub, internal wikis, and APIs.The Three Layers of Agentic Retrieval InfrastructureAn agent on top, heterogeneous data on the bottom, a control plane in betweenAgent / LLMLAYER 3 · UNIFIED ORCHESTRATION & ROUTINGthe control plane · declarative, identity-aware, observableauth · RBAC · ETL · cache · observabilityShapedMem0Unified API GatewaysLAYER 2 · SPECIALIZED STORAGE PRIMITIVESpersistent, massive-scale · the anchor nodes of agent memoryPinecone · WeaviateElasticsearch · OpenSearchNeo4jLAYER 1 · EPHEMERAL SANDBOXESper-session scratchpads · spun up and torn down on demand via MCPDuckDBSQLiteIn-mem FAISSAST / LSPgrepvia MCPDATA SOURCESraw data the control plane extracts, syncs, and permissionsShopifySlackAmplitudeGitHubInternal WikisAPIs

When you look at this emerging infra stack today, it breaks down into three distinct components that agents must weave together:

1. Ephemeral Sandboxes (DuckDB, SQLite, Local AST, in-memory vector DBs). Not every piece of data needs to be committed to permanent storage. This layer serves as the agentic equivalent of a scratchpad. When an agent enters a new repository to resolve an issue, it needs the ability to spin up temporary infrastructure via protocols like MCP (Model Context Protocol). It might dump a subset of relational data into DuckDB or build an in-memory vector index of recent logs, execute a series of complex queries for that single session, and simply tear it down when the task is complete. It is high-speed, low-stakes, temporary memory.

2. Specialized Storage Primitives (Pinecone, Elasticsearch, Neo4j). When an agent needs long-term, massive-scale memory, it turns to the heavy lifters. If an agent is tasked with diagnosing an issue by referencing a historical index of every customer support ticket ever resolved, ephemeral sandboxes won’t cut it. It routes the data into highly performant, specialized databases. Whether it’s Pinecone for semantic similarity, Elasticsearch for massive lexical search, or Neo4j for deep relationship traversal, these systems require strict schema and cluster management, serving as the “anchor” nodes of an agent’s persistent memory.

3. Unified Orchestration & Routing (Shaped, Mem0, unified API gateways). Giving an agent access to a local SQLite scratchpad and a massive remote Pinecone cluster is useless if the underlying data state becomes a chaotic mess. This layer is the connective tissue of the agentic stack. It abstracts away the operational nightmare of keeping heterogeneous databases in sync, manages semantic caching to cut down on redundant LLM costs, and provides a single routing interface. Rather than forcing developers, or the agents themselves, to manually wire together five different databases, this orchestration layer provides the clean primitives necessary to execute the agentic decision tree seamlessly.

You might be wondering: if the agent is the data architect, shouldn’t it be the orchestration layer?

Yes and no. The agent handles the cognitive orchestration — it makes the architectural decisions about which database to use, when to spin it up, and how to query it. But we do not want to waste expensive, slow frontier-model compute on the mechanical drudgery of data plumbing.

Crucially, specialized storage primitives are just storage; they don’t handle the data pipelines or access permissioning. If an agent decides to build an index using live data from Shopify, Amplitude, or a secured internal wiki, we don’t want the LLM writing raw OAuth flows, managing SaaS API rate limits, or manually enforcing user-level access controls (RBAC) on every query.

This third layer acts as the control plane. It abstracts away the operational nightmare of extracting data from external platforms, keeping heterogeneous databases in sync, managing semantic caching, and strictly enforcing security boundaries. The agent can simply declare “keep this Shopify data synced to this local index for this authorized user,” and the infrastructure handles the rest. It gives the agent the clean, secure API primitives it needs to execute its architecture without getting bogged down in boilerplate integrations.

Building the Agentic Retrieval Plane

If we accept that agents will be the ones making architectural decisions, it fundamentally changes how we design this third layer, the unified data abstraction and ETL routing. Historically, data pipelines were built with human engineers in mind. We have tools like Fivetran or Airbyte, which are heavily reliant on GUI dashboards, static configurations, and rigid schemas. But an autonomous agent doesn’t want to click through a dashboard, and it certainly doesn’t want to be locked into a static pipeline it can’t modify when the data context shifts. So, what does the ideal technology or solution look like for this routing layer? An agent-native data control plane needs to possess a few core characteristics:

Anatomy of an agent-native data control planeA flow diagram. An Agent or LLM sends a declarative intent, formatted as JSON via MCP, into a routing layer with four pillars: 01 Declarative Provisioning (parse spec and compile pipeline), 02 Late-Binding Transforms (chunk, embed, hot-swap logic), 03 Identity-Aware Proxy (user token enforcing RLS and RBAC), and 04 Observability plus Feedback (cache hits, tokens, latency, judge signals). The layer executes against three sinks: Ephemeral Sandbox (Layer 1), Specialized Stores (Layer 2), and External Sources (raw data). A dashed telemetry feedback loop runs from Observability back up to the Agent to guide the next retrieval decision.Anatomy of an Agent-Native Data Control PlaneA retrieval intent flowing through the routing layer, with telemetry looping back to the agentAgent / LLMdeclarative intent · JSON / MCPROUTING LAYER · CONTROL PLANEdeclarative · late-binding · identity-aware · observable01DeclarativeProvisioningparse spec →compile pipeline02Late-BindingTransformschunk · embed ·hot-swap logic03Identity-AwareProxyuser token →RLS · RBAC04Observability+ Feedbackcache · tokens ·latency · judgeexecutes againstEphemeral SandboxDuckDB · SQLite · FAISS · ASTLAYER 1Specialized StoresPinecone · Elasticsearch · Neo4jLAYER 2External SourcesShopify · Slack · GitHub · wikisRAW DATAtelemetry · feedback loopAgent declares intent · plane compiles & executes · telemetry returns to guide the next decision

1. Declarative, API-First Provisioning. The layer must be entirely controllable via code, ideally integrating seamlessly with standard protocols like MCP. The agent shouldn’t have to write the execution logic. It should be able to pass a JSON payload that essentially says: “Connect to this specific Shopify store, extract the last 30 days of inventory data, apply a recursive markdown chunker, and sink it into my ephemeral DuckDB instance.” The routing layer acts as the compiler for this intent, executing the pipeline securely in the background.

2. Dynamic and Late-Binding Transformations. In a traditional pipeline, chunking strategies and embedding models are decided upfront. But in an agentic world, the agent learns from its queries. The ideal routing layer allows the agent to dynamically inject custom transformation logic into the pipeline on the fly. If the agent notices it is missing critical relationship context in its vector searches, it can ping the routing layer API to update the pipeline: “Re-index these documents, but this time, run them through an LLM extraction step first to pull out key entities, and sink that into Neo4j.”

3. Identity-Aware by Default. Security and permissioning cannot be an afterthought left to the agent. A frontier model cannot be trusted to independently enforce Row-Level Security. The ideal routing layer handles identity as a first-class primitive. When the agent initiates a retrieval query, it passes the user’s token to the routing layer, which acts as a secure proxy. The layer ensures that whether the data is coming from a live Slack API or a cached Pinecone index, the agent only receives the chunks that specific user is authorized to see.

4. Built-in Observability and Feedback Loops. If agents are going to continuously improve their own data infrastructure based on incoming queries, they need telemetry. The routing layer must expose observability metrics back to the agent in a machine-readable format. The agent needs to be able to query the routing layer to ask: “What was the cache hit rate for the last 100 queries?” or “How many tokens am I burning on this specific BM25 index vs the vector store?” This telemetry is the feedback loop that allows the self-replicating software to optimize its own architecture. Ultimately, the ideal routing layer stops treating the agent as just a “user” of a database, and starts treating it like a first-class developer. It provides the heavy industrial machinery — the ETL, the syncs, the auth, the cache — so the agent can focus entirely on being the architect.

Conclusion: Letting Agents Build Their Own Mind

For the last few years, we’ve treated AI memory as a static, human-driven engineering problem. We built rigid RAG pipelines, argued over chunking strategies, and shoved everything into vector databases, hoping it would be enough to give our models context.

But as models have grown smarter, they’ve started to outgrow these rigid structures. The widespread reliance on grep and massive context windows in tools like Cursor isn’t a step backward; it’s a symptom of models demanding frictionless, dynamic access to exactly the data they need, exactly when they need it.

We are entering an era where the concept of a single “long-term memory” database for AI is obsolete. Instead, information retrieval is becoming a fluid, heterogeneous decision tree. And the most profound shift is who gets to navigate that tree.

If we truly believe in the promise of self-replicating software, where agents write code and build tools to solve their own bottlenecks, we have to accept that agents will soon become their own data architects. They will observe the queries they are tasked with answering, provision temporary sandboxes, route complex workflows into specialized heavy-lifters, and write their own migrations as the scope of their work expands.

Our job as human engineers is no longer to build the perfect index. Our job is to build the sandbox.

By focusing on the underlying control plane, the declarative, identity-aware routing layers that handle the mechanical drudgery of ETL and synchronization, we can step out of the critical path. If we provide agents with the right infrastructure building blocks and treat them as first-class developers, we won’t need to guess how to structure their data. They will build exactly the memory they need, query by query.

See Shaped in action

Talk to an engineer about your specific use case — search, recommendations, or feed ranking.

Book a demo →

Related Articles