GitOps for AI Agents: Versioning Your Retrieval Logic

You wouldn't deploy code without version control. Why deploy your agent's retrieval logic as a hardcoded prompt? Learn how to declare engines, queries, and ranking logic as YAML files that live in your repo and deploy via CI/CD.

Mar 5, 2026

min read

Nic Scheltema

Quick Answer: Your Agent’s Logic Should Live in Git

You have 47 versions of your application code in Git. You can diff, rollback, and deploy them through CI/CD. But your agent’s retrieval logic—the ranking expressions, filters, signal definitions that determine what context it sees—lives in:

A hardcoded string in your application
A prompt template in production without version history
Manual configuration in a web UI with no audit trail
Environment variables scattered across deployment configs

When your agent breaks, you can’t diff what changed. When you need to A/B test retrieval strategies, you duplicate code. When you want to rollback a bad ranking model, you hope you remember the previous configuration.

This is infrastructure as chaos, not infrastructure as code.

Key Takeaways:

Retrieval logic is infrastructure — Engines, queries, and ranking models should be versioned like any other code
YAML > Hardcoded strings — Declarative configuration files enable diff, review, and rollback
Git is your source of truth — Changes to agent behavior go through PR review, not live edits in production
CI/CD deploys your agent — Merge to main triggers automated deployment with zero downtime
Shaped CLI enables GitOps — shaped create-engine --file engine.yaml deploys from your repo

Time to read: 20 minutes | Includes: 8 code examples, 2 architecture diagrams, 1 deployment workflow

The Version Control Gap

Imagine your product recommendation agent breaks. Conversion rate drops 40% overnight. You need to find what changed.

For your application code, this is trivial:

git log --oneline --since="1 week ago"
git diff HEAD~5 HEAD -- agent/retrieval.py

You see exactly what changed. You can rollback with git revert. You can compare behavior across branches.

But for your agent’s retrieval logic—the part that determines which products get ranked, which filters apply, which signals feed the model—you have:

Option 1: Hardcoded in application

# agent.py (committed to Git ✓)
def get_recommendations(user_id, query):
    results = vector_db.search(
        query=query,
        filter=f"category='electronics' AND price < 500",  # ← Hardcoded
        limit=100
    )
    # Rank by score formula (also hardcoded)
    scored = [(r, r['embedding_score'] * 0.6 + r['popularity'] * 0.4) for r in results]
    return sorted(scored, key=lambda x: x[1], reverse=True)[:10]

This is versioned in Git, but changing the filter requires a code deploy. Every ranking experiment needs a new application release.

Option 2: Environment variables

# .env (NOT in Git ✗)
RETRIEVAL_FILTER="category='electronics' AND price < 500"
RANKING_FORMULA="embedding_score * 0.6 + popularity * 0.4"

This decouples retrieval from code, but now your configuration has no version history. You don’t know when RANKING_FORMULA changed or why.

Option 3: Database-stored config

# Stored in PostgreSQL config table
config = db.query("SELECT value FROM agent_config WHERE key = 'ranking_formula'").fetchone()
ranking_formula = eval(config['value'])  # ← Executing arbitrary strings from DB

Now you have an audit trail (if you log changes), but no diff, no PR review, and no way to test changes in staging before production.

Option 4: Manual UI configuration

You log into a vector DB console and change filters/ranking via web forms. Changes take effect immediately. No history, no rollback, no review.

The Problem

None of these approaches treat retrieval logic as infrastructure—versioned, reviewed, tested, deployed through CI/CD.

When your agent breaks:

You can’t diff what changed in retrieval logic vs application code
You can’t rollback just the ranking model without a full code deploy
You can’t A/B test retrieval strategies across branches
You can’t review changes to filters/signals/ranking in a PR
You can’t deploy staging → production with confidence

What GitOps for Agents Looks Like

GitOps for agents means: All agent configuration lives in Git as declarative YAML files.

The Repository Structure

my-agent/
├── src/
│   └── agent.py                    # Application code
├── infrastructure/
│   ├── engines/
│   │   ├── product_search.yaml     # Search engine config
│   │   ├── recommendations.yaml    # Recommendation engine
│   │   └── content_feed.yaml       # Feed ranking
│   ├── signals/
│   │   └── features.yaml           # Feature definitions
│   └── queries/
│       ├── personalized_search.sql # Saved query templates
│       └── trending_products.sql
├── .github/
│   └── workflows/
│       └── deploy.yml              # CI/CD pipeline
└── README.md

What’s in Git: Engine definitions (indexing, models, deployment config), signal definitions (features, aggregations, crosses), query templates (ranking expressions, filters), and deployment configuration (replicas, data tier, auto-scaling).

What’s NOT in Git: API keys (stored in GitHub Secrets), model weights (stored in Shaped, versioned automatically), and runtime data (logs, metrics).

The GitOps Workflow

Developer edits engine.yaml

↓

Create PR in GitHub

↓

CI runs validation

shaped validate —file engine.yaml
shaped plan —file engine.yaml

↓

PR review + approval

↓

Merge to main

↓

CD deploys to staging

shaped create-engine —file engine.yaml —env staging

↓

Automated tests pass

↓

CD deploys to production

shaped create-engine —file engine.yaml —env production

↓

Zero-downtime rollout

Key benefits:

Every change goes through review — No one can change ranking logic without a PR
Diff-able — git diff shows exactly what changed in retrieval behavior
Rollback-able — git revert + redeploy restores previous config
Testable — Deploy to staging first, run automated tests
Auditable — Full history of who changed what and when

Part 1: The Traditional Approach (Hardcoded Retrieval)

The standard approach is to hardcode retrieval logic in your application, with configuration split between code, environment variables, and manual UI settings.

Architecture

Agent Application (Python)

Hardcoded filters · .env ranking · DB config

↓

Fetch User State

Redis / PostgreSQL

↔

Separate service call
No link to retrieval

↓

Vector Search

Pinecone / Weaviate

↔

Hardcoded filter expressions
No version history

↓

Manual Re-rank

Custom Python scoring logic

↔

Formula lives in .env
Changed manually

↓

Results returned to agent

Problems:

Configuration is fragmented (code + env + UI)
No single source of truth
Changes require code deploys OR manual UI updates
No version history for non-code config
Difficult to A/B test or stage changes

Implementation

Step 1: Hardcoded retrieval in application

# agent.py
import os
from pinecone import Pinecone

pc = Pinecone(api_key=os.getenv('PINECONE_API_KEY'))
index = pc.Index("product-index")

def search_products(user_id: str, query: str):
    """
    Search for products.
    All logic is hardcoded or in env vars.
    """
    from sentence_transformers import SentenceTransformer
    embedder = SentenceTransformer('all-MiniLM-L6-v2')
    query_vector = embedder.encode(query).tolist()

    results = index.query(
        vector=query_vector,
        filter={
            "category": {"$eq": "electronics"},
            "price": {"$lt": 500},
            "in_stock": {"$eq": True}
        },
        top_k=100,
        include_metadata=True
    )

    ranking_formula = os.getenv(
        'RANKING_FORMULA',
        'embedding_score * 0.6 + popularity * 0.4'
    )

    scored_results = []
    for match in results['matches']:
        embedding_score = match['score']
        popularity = match['metadata'].get('view_count', 0) / 10000
        score = embedding_score * 0.6 + popularity * 0.4
        scored_results.append({
            'product_id': match['id'],
            'name': match['metadata']['name'],
            'score': score
        })

    scored_results.sort(key=lambda x: x['score'], reverse=True)
    return scored_results[:10]

Step 2: Environment configuration

# .env (not in version control)
PINECONE_API_KEY=your-key
RANKING_FORMULA=embedding_score * 0.6 + popularity * 0.4
TOP_K=100
PRICE_LIMIT=500

What You’re Operating

Component	Config Method	Version Control	Rollback
Filter logic	Hardcoded Python	Yes (Git)	Code deploy
Ranking formula	Environment variable	No	Manual change + restart
Index settings	Vector DB UI	No	Remember + recreate
Embedding model	Hardcoded in code	Yes (Git)	Code deploy
Top-K / limits	Environment variable	No	Manual change + restart

The problem: Changing retrieval behavior requires mixing code deploys, environment updates, and manual UI changes. There’s no single workflow, no unified version history.

Part 2: The Shaped Way — Declarative Engines

Shaped treats engines as declarative infrastructure. You define your entire retrieval pipeline in a YAML file that lives in Git.

Architecture

Git Repository

infrastructure/engines/product_search.yaml

↓

CI/CD Pipeline

GitHub Actions

→

shaped validate
shaped plan

↓

Shaped CLI

shaped create-engine —file …

↓

Shaped Control Plane
✓ Validates config ✓ Compiles to execution graph ✓ Deploys with zero downtime ✓ Versions model weights

↓

Agent queries Shaped API

All config from Git — nothing hardcoded

Everything is versioned, everything is code.

Implementation

Step 1: Define engine in YAML

# infrastructure/engines/product_search.yaml
version: v2
name: product_search
data:
  item_table:
    name: products
    type: table
  user_table:
    name: users
    type: table
  interaction_table:
    name: user_clicks
    type: table

index:
  - name: product_embedding
    encoder:
      name: text-embedding-3-small
      provider: openai
      columns:
        - name: product_name
          weight: 0.6
        - name: description
          weight: 0.4

training:
  models:
    - name: collaborative_ranker
      policy_type: elsa
      strategy: early_stopping

deployment:
  data_tier: fast_tier
  server:
    worker_count: 4
  autoscaling:
    enabled: true
    min_instances: 2
    max_instances: 10
    target_qps: 1000

Step 2: Define ranking logic in ShapedQL

-- infrastructure/queries/personalized_search.sql
SELECT 
  item_id,
  product_name,
  price,
  category,
  score(
    expression = '
      (1.0 / (1.0 + rank(embedding="product_embedding"))) * 0.6
      + (1.0 / (1.0 + item._derived_popular_rank)) * 0.3
      + user_category_affinity * 0.1
    '
  ) as final_score
FROM products
WHERE 
  category = 'electronics'
  AND price < 500
  AND in_stock = true
ORDER BY final_score DESC
LIMIT 10

Step 3: Deploy via CLI in CI/CD

# .github/workflows/deploy-agent.yml
name: Deploy Agent Infrastructure

on:
  push:
    branches: [main]
    paths:
      - 'infrastructure/engines/**'
      - 'infrastructure/queries/**'

jobs:
  deploy:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      
      - name: Install Shaped CLI
        run: pip install shaped-cli
      
      - name: Validate engine config
        run: shaped validate --file infrastructure/engines/product_search.yaml
        env:
          SHAPED_API_KEY: ${{ secrets.SHAPED_API_KEY }}
      
      - name: Deploy to staging
        run: |
          shaped create-engine \
            --file infrastructure/engines/product_search.yaml \
            --env staging
        env:
          SHAPED_API_KEY: ${{ secrets.SHAPED_API_KEY_STAGING }}
      
      - name: Run integration tests
        run: python tests/test_retrieval.py --env staging
      
      - name: Deploy to production
        run: |
          shaped create-engine \
            --file infrastructure/engines/product_search.yaml \
            --env production
        env:
          SHAPED_API_KEY: ${{ secrets.SHAPED_API_KEY_PROD }}

Step 4: Query from application (no config in code)

# agent.py — now just API calls, no retrieval config
import requests
import os

SHAPED_API_KEY = os.getenv('SHAPED_API_KEY')

def search_products(user_id: str, query: str):
    response = requests.post(
        "https://api.shaped.ai/v2/rank",
        headers={"x-api-key": SHAPED_API_KEY},
        json={
            "engine_name": "product_search",
            "user_id": user_id,
            "query": query,
            # Ranking logic lives in Git, not here
        }
    )
    return response.json()['results']

Now your application code is thin—it just calls the Shaped API. All retrieval logic lives in version-controlled YAML.

The GitOps Workflow in Practice

1. Make Changes via PR

A developer wants to change the ranking formula:

# Create branch
git checkout -b experiment/boost-recency

# Edit engine config
vim infrastructure/engines/product_search.yaml

--- a/infrastructure/engines/product_search.yaml
+++ b/infrastructure/engines/product_search.yaml
@@ -23,7 +23,8 @@ training:
     - name: collaborative_ranker
       policy_type: elsa
       ranking_expression: |
-        (1.0 / (1.0 + rank(embedding="product_embedding"))) * 0.6
-        + (1.0 / (1.0 + item._derived_popular_rank)) * 0.4
+        (1.0 / (1.0 + rank(embedding="product_embedding"))) * 0.5
+        + (1.0 / (1.0 + item._derived_popular_rank)) * 0.3
+        + (1.0 / (1.0 + days_since_published)) * 0.2

Reviewers can see exactly what changed—recency boost added at 20% weight, embedding down from 60% to 50%, popularity down from 40% to 30%. They can comment on the weights, suggest alternatives, and approve once satisfied.

2. CI Validates

shaped validate --file infrastructure/engines/product_search.yaml
# ✓ YAML is valid
# ✓ All referenced columns exist in data tables
# ✓ Ranking expression syntax is correct
# ✓ No breaking changes detected

3. Deploy to Staging, Test, Promote

# tests/test_retrieval.py
def test_recency_boost():
    """Verify recent products rank higher after change"""
    results = search_products(user_id="test-user", query="laptop")
    recent_items = [r for r in results[:5] if r['days_since_published'] < 7]
    assert len(recent_items) >= 2, "Recent items should rank higher"

4. Rollback if Needed

git revert HEAD
git push origin main
# CI/CD automatically redeploys previous config

Or manually:

shaped rollback-engine --name product_search --to-version v23

Real-World Examples

Example 1: A/B Test Ranking Strategies

Deploy two engine variants simultaneously, split traffic in application code, measure, commit the winner to main.

# infrastructure/engines/product_search_popular.yaml
training:
  models:
    - name: popular_ranker
      ranking_expression: |
        embedding * 0.4 + popularity * 0.6

# infrastructure/engines/product_search_personalized.yaml
training:
  models:
    - name: personalized_ranker
      ranking_expression: |
        embedding * 0.3 + user_affinity * 0.7

# agent.py
import random

def search_products(user_id, query):
    engine = "product_search_popular" if random.random() < 0.5 else "product_search_personalized"
    return shaped.rank(engine_name=engine, user_id=user_id, query=query)

Example 2: Multi-Environment Config

Staging runs an aggressive recency experiment while production stays conservative—two branches, two YAML files, same workflow.

# staging branch: product_search.yaml
ranking_expression: |
  embedding * 0.3 + recency * 0.5 + popularity * 0.2

# main branch: product_search.yaml
ranking_expression: |
  embedding * 0.5 + recency * 0.2 + popularity * 0.3

Comparison: Traditional vs. GitOps with Shaped

Aspect	Traditional	GitOps w/ Shaped
Version Control	Partial (code only)	Complete (config + code)
Rollback	Full app redeploy	Single git revert
A/B Testing	Manual code duplication	Branch-based experiments
Review Process	Code + manual UI changes	Single PR diff
Deployment	Tied to releases	Independent CI/CD
Audit Trail	Git log only	Git + Shaped versioning
Debugging	”Why did this change?” → ???	”Why did this change?” → git log -p

FAQ

Q: Do I have to use Shaped to do GitOps for agents?

Not necessarily. You can build similar workflows with any infrastructure that supports declarative configuration (Terraform, Ansible, custom tools). Shaped is purpose-built for retrieval, so the workflow is simpler.

Q: Won’t GitOps slow down my experimentation?

The opposite. PR review actually accelerates experiments because multiple people can spot issues before production, you can revert instantly if something breaks, staging deploys are automatic and fast, and you have a complete history of what worked.

Q: What if I need to change ranking in real-time (e.g., emergency flash sale)?

Shaped supports both static (versioned) and dynamic configuration. You can have base rankings in Git and override them via API for time-limited events.

Q: How do I handle secrets (API keys, model credentials)?

Keep secrets in GitHub Secrets or your CI/CD provider. The YAML references them by name but never contains actual values.

Q: Can I revert just the ranking formula without changing filters?

Yes. Each component (embeddings, models, filters, signals) is independently versioned. You can rollback any single piece.

Getting Started

To implement GitOps for your AI agents:

Audit your current setup — Where does your retrieval logic live? (code, env, UI, DB?)
Export to declarative format — Convert it to YAML. This is a one-time effort.
Set up CI/CD — Add validation and deployment steps to your pipeline
Build in stages — Start with staging, validate, then production
Monitor and iterate — Track which experiments work, commit winners to main

The investment pays for itself the first time you need to rollback, debug, or run an A/B test.

GitOps for AI Agents: Versioning Your Retrieval Logic

Quick Answer: Your Agent’s Logic Should Live in Git

The Version Control Gap

Option 1: Hardcoded in application

Option 2: Environment variables

Option 3: Database-stored config

Option 4: Manual UI configuration

The Problem

What GitOps for Agents Looks Like

The Repository Structure

The GitOps Workflow

Part 1: The Traditional Approach (Hardcoded Retrieval)

Architecture

Implementation

What You’re Operating

Part 2: The Shaped Way — Declarative Engines

Architecture

Implementation

The GitOps Workflow in Practice

1. Make Changes via PR

2. CI Validates

3. Deploy to Staging, Test, Promote

4. Rollback if Needed

Real-World Examples

Example 1: A/B Test Ranking Strategies

Example 2: Multi-Environment Config

Comparison: Traditional vs. GitOps with Shaped

FAQ

Getting Started

Related Articles

Further Reading