Multimodal Agents: Extracting Agent Context from Images

E-commerce and social agents fail when visual data is locked in images. Learn how to use vision-language models in AI Views to automatically extract searchable text descriptions from product photos, user content, and visual data—so your agent can reason over what it sees.

Feb 26, 2026

min read

Shaped Team

Quick Answer: Why Your Agent Can’t See

An AI agent answers from the context it retrieves. If your product catalog has images but no textual descriptions, the agent can’t search them. A user asks “show me navy blue dresses with floral patterns” — your database has the images, but the agent retrieves nothing because there’s no text to match.

That’s not a model problem. That’s a visual blindness problem — your images contain semantic information, but your agent can only search text.

Key Takeaways:

Images are semantically rich but unsearchable — A product photo contains color, style, pattern, material, but agents can’t retrieve it without text
Vision-language models extract searchable descriptions — GPT-4V, Claude 3, Gemini Vision analyze images and generate text descriptions
AI Views materialize visual context at ingestion — Extract descriptions once when images are uploaded, not every time the agent queries
Works on any visual data — Product photos, user-generated content, social media images, document scans, receipts
Zero inference-time cost — The agent retrieves pre-extracted descriptions, no vision model call during search

Time to read: 20 minutes | Includes: 7 code examples, 2 architecture diagrams, 1 comparison table

The Visual Blindness Problem
What Multimodal AI Views Do
Part 1: The Traditional Approach
Part 2: The Shaped Way — Vision AI Views
Real-World Use Cases
Comparison Table
FAQ

The Visual Blindness Problem

Imagine an e-commerce fashion agent. A user searches:

User: “Show me navy blue dresses with floral patterns and long sleeves”

The product database has 50,000 dress images. 127 of them match this description. But here’s what the database actually stores:

{
  "product_id": "DRESS-4821",
  "name": "Summer Dress",
  "category": "Dresses",
  "price": 89.99,
  "image_url": "https://cdn.example.com/dresses/4821.jpg",
  "in_stock": true
}

The image_url points to a photo of a navy blue floral dress with long sleeves. But there’s no text field that says “navy blue,” “floral pattern,” or “long sleeves.” The agent can’t retrieve it.

What Happens in Production

Agent query: “navy blue dresses with floral patterns”

Vector search result: Returns nothing, because there’s no textual match for “navy blue” or “floral” in the indexed fields.

Agent response:

“I couldn’t find any navy blue floral dresses. Would you like to browse our dress collection?”

What the user should have seen:

“Here are 127 navy blue floral dresses. The top match is our Summer Maxi Dress with a delicate floral print and long sleeves, currently $89.99.”

The information exists — it’s in the image. But it’s locked away in pixels, invisible to text-based retrieval.

Why This Happens Everywhere

E-commerce: Product images uploaded by merchants often have minimal metadata. The image shows “burgundy velvet blazer with gold buttons” but the product name is just “Blazer - SKU 8472.”

Social media: User-generated content is entirely visual. A photo of a sunset over a beach has no caption. An agent can’t retrieve it for “beach sunset photos” because there’s no text to match.

Real estate: Listing photos show “modern kitchen with marble countertops and stainless steel appliances” but the database field just says kitchen_photo_1.jpg.

Document management: Scanned receipts, invoices, contracts contain critical text and visual layout, but they’re stored as image files with no searchable content.

The semantic gap:

Image contains → Navy blue dress, floral pattern, long sleeves, A-line cut
Database has  → "Summer Dress", category: "Dresses"
Agent retrieves → Nothing (no text match)

What Multimodal AI Views Do

A multimodal AI View uses a vision-language model (like GPT-4V, Claude 3 Sonnet, or Gemini Vision) to analyze images and generate textual descriptions. These descriptions are materialized as searchable fields in your index.

The Pipeline

Product Upload with Image

↓

Image:
dress-4821.jpg

↔

Storage

↓

AI View:
Vision Enrichment

↔

GPT-4V / Claude 3
Vision LM

↓

Generated
Description:
“Navy blue dress,
floral print,
long sleeves”

→

Extracted Text

↓

Shaped Index
(Searchable)

→

image_description
field materialized

↓

Agent Query:
“navy blue
floral dresses”

→

Retrieves exact
match → FOUND

Key Properties

1. Write-time extraction The vision model runs when the image is uploaded, not when the agent queries. Extraction cost is paid once per image, not on every search.

2. Materialized as text The image description is stored as a searchable text field. The agent retrieves it using standard vector search — no vision model call during retrieval.

3. Automatic updates When a product image changes, the AI View re-extracts the description. The index stays current with the visual content.

4. Source images stay untouched The original image_url is preserved. The AI View adds a new image_description column — it doesn’t replace anything.

Part 1: The Traditional Approach (Manual Tagging)

The standard solution is to manually tag images with textual attributes. A human (or a team of humans) looks at each product image and fills in structured fields.

Architecture

Product Uploaded with Image

↓

Merchant
Portal:
Tag Image

↔

Human reviews
and tags manually

↓

Structured
Fields:
color, pattern,
size, material

→

Manual metadata
entry

↓

Database
Storage:
Tags indexed

↔

PostgreSQL /
Search Index

↓

Agent
Retrieval:

→

Searches
structured fields

This works for small catalogs. It breaks at scale.

Implementation

Step 1: Tagging interface

# merchant_tagging_portal.py
from flask import Flask, render_template, request
import psycopg2

app = Flask(__name__)

@app.route('/tag-product/<product_id>')
def tag_product(product_id):
    """
    Show tagging form for a product image.
    """
    conn = psycopg2.connect("dbname=ecommerce user=admin")
    cursor = conn.cursor()
    
    cursor.execute("""
        SELECT product_id, product_name, image_url
        FROM products
        WHERE product_id = %s
    """, (product_id,))
    
    product = cursor.fetchone()
    
    return render_template('tag_form.html', product=product)

@app.route('/save-tags/<product_id>', methods=['POST'])
def save_tags(product_id):
    """
    Save manually entered tags to database.
    """
    tags = {
        'color': request.form.get('color'),
        'pattern': request.form.get('pattern'),
        'sleeve_length': request.form.get('sleeve_length'),
        'neckline': request.form.get('neckline'),
        'material': request.form.get('material'),
        'style': request.form.get('style'),
        'length': request.form.get('length')
    }
    
    conn = psycopg2.connect("dbname=ecommerce user=admin")
    cursor = conn.cursor()
    
    cursor.execute("""
        UPDATE products
        SET color = %s, pattern = %s, sleeve_length = %s,
            neckline = %s, material = %s, style = %s, length = %s,
            tags_updated_at = NOW()
        WHERE product_id = %s
    """, (tags['color'], tags['pattern'], tags['sleeve_length'],
          tags['neckline'], tags['material'], tags['style'],
          tags['length'], product_id))
    
    conn.commit()
    return "Tags saved"

Step 2: Tag schema

-- product_tags.sql
ALTER TABLE products ADD COLUMN color VARCHAR(100);
ALTER TABLE products ADD COLUMN pattern VARCHAR(100);
ALTER TABLE products ADD COLUMN sleeve_length VARCHAR(50);
ALTER TABLE products ADD COLUMN neckline VARCHAR(50);
ALTER TABLE products ADD COLUMN material VARCHAR(100);
ALTER TABLE products ADD COLUMN style VARCHAR(100);
ALTER TABLE products ADD COLUMN length VARCHAR(50);
ALTER TABLE products ADD COLUMN tags_updated_at TIMESTAMP;

CREATE INDEX idx_products_color ON products(color);
CREATE INDEX idx_products_pattern ON products(pattern);
CREATE INDEX idx_products_material ON products(material);

Step 3: Agent retrieval with structured tags

# agent_search.py
import psycopg2
from sentence_transformers import SentenceTransformer

def search_products(query: str):
    """
    Search products using manually tagged attributes.
    """
    # Extract attributes from natural language query
    # (This requires NLP parsing or keyword matching)
    attributes = extract_attributes(query)
    # Example: {"color": "navy blue", "pattern": "floral", "sleeve_length": "long"}
    
    conn = psycopg2.connect("dbname=ecommerce user=admin")
    cursor = conn.cursor()
    
    # Build SQL WHERE clause from extracted attributes
    where_clauses = []
    params = []
    
    if 'color' in attributes:
        where_clauses.append("color ILIKE %s")
        params.append(f"%{attributes['color']}%")
    
    if 'pattern' in attributes:
        where_clauses.append("pattern ILIKE %s")
        params.append(f"%{attributes['pattern']}%")
    
    if 'sleeve_length' in attributes:
        where_clauses.append("sleeve_length = %s")
        params.append(attributes['sleeve_length'])
    
    where_sql = " AND ".join(where_clauses) if where_clauses else "1=1"
    
    cursor.execute(f"""
        SELECT product_id, product_name, image_url, color, pattern, sleeve_length
        FROM products
        WHERE {where_sql}
        LIMIT 20
    """, params)
    
    results = cursor.fetchall()
    return results

What You’re Operating

Component	What It Is	Failure Mode
Tagging portal	Flask app for manual data entry	Slow, inconsistent, expensive
Human taggers	Team of people viewing images	High error rate, terminology inconsistency
Structured schema	Fixed columns for each attribute	Rigid, can’t handle new product types
Attribute extraction	NLP to parse queries into filters	Misses synonyms (“long sleeve” vs “full sleeve”)
Tag maintenance	Update tags when images change	Often forgotten, tags go stale

The cost:

Tagging throughput: 1 human tagger can tag ~50-100 products/hour at $15-25/hour
For 10K products: 100-200 hours of tagging = $1,500-5,000 initial cost
Ongoing: Every new product upload requires manual tagging before it’s searchable
Quality: 15-25% error rate on subjective attributes like “style” or “pattern”
Scalability: Breaks completely for user-generated content (can’t manually tag millions of social media images)

Part 2: The Shaped Way — Vision AI Views

Shaped’s multimodal AI Views use vision-language models to automatically extract descriptions from images at ingestion time. You define a view that includes image_url in the source columns, write a prompt describing what to extract, and the vision model generates textual descriptions that get indexed.

Architecture

Product Uploaded with Image

↓

Shaped
Ingestion

↔

Detects image_url
in source_columns

↓

AI View:
Vision Enrichment
(source: image_url)

→

Triggered at
ingestion time

↓

Vision Model
(GPT-4V /
Claude 3)
analyzes image

←

Fetches image
from https://url

↓

Generated
Description
(“Navy blue
dress, floral
pattern…”)

→

Extracted
from image

↓

Shaped Index
(image_description
field materialized)

↔

Searchable &
queryable

↓

Agent Retrieval
Vector search on
image_description

→

FOUND:
exact match

Implementation

Step 1: Connect your product table

# products_table.yaml
version: v2
name: products
schema_type: POSTGRES
host: postgres.example.com
port: 5432
database: ecommerce
table_name: products
schema:
  - name: product_id
    type: STRING
  - name: product_name
    type: STRING
  - name: category
    type: STRING
  - name: price
    type: FLOAT
  - name: image_url
    type: STRING  # ← Critical: vision model reads from this column
                  # Must contain publicly accessible URLs (https://...)
  - name: in_stock
    type: BOOLEAN
  - name: created_at
    type: TIMESTAMP

shaped create-table --file products_table.yaml

Step 2: Create the vision AI View

curl -X POST "https://api.shaped.ai/v2/views" \
  -H "Content-Type: application/json" \
  -H "x-api-key: $SHAPED_API_KEY" \
  -d '{
    "name": "products_with_image_descriptions",
    "view_type": "AI_ENRICHMENT",
    "source_table": "products",
    "source_columns": [
      "product_id",
      "product_name",
      "category",
      "image_url"
    ],
    "source_columns_in_output": [
      "product_id",
      "product_name",
      "category",
      "price",
      "image_url"
    ],
    "enriched_output_columns": [
      "image_description"
    ],
    "prompt": "Analyze the product image and describe the item'\''s visual characteristics in detail. Focus on:\n- Primary color and any accent colors\n- Pattern (solid, floral, striped, geometric, etc.)\n- Style and cut (A-line, fitted, oversized, etc.)\n- Sleeve length (sleeveless, short, 3/4, long)\n- Neckline type (crew, V-neck, scoop, etc.)\n- Material texture and finish (cotton, silk, leather, etc.)\n- Length (mini, knee-length, midi, maxi)\n- Any visible branding, text, or distinctive features\n\nBe factual and specific. Avoid marketing language."
  }'

What this does:

Reads image_url and other metadata from the products table
The image_url column must contain publicly accessible URLs (the vision model fetches images from these URLs)
Note: If your image column has a different name (e.g., photo_url, img_path), create an SQL view first to rename it to image_url
Passes the image to a vision-language model (GPT-4V or Claude 3 Sonnet)
Vision model analyzes the image and generates a description following the prompt instructions
Description is stored in the image_description column
This enrichment runs once when the product is uploaded or updated
The enriched view products_with_image_descriptions is now a queryable table

Output example:

Given this input row:

{
  "product_id": "DRESS-4821",
  "product_name": "Summer Dress",
  "category": "Dresses",
  "price": 89.99,
  "image_url": "https://cdn.example.com/dresses/4821.jpg"
}

The AI View generates:

{
  "product_id": "DRESS-4821",
  "product_name": "Summer Dress",
  "category": "Dresses",
  "price": 89.99,
  "image_url": "https://cdn.example.com/dresses/4821.jpg",
  "image_description": "Navy blue A-line dress with white and pink floral print throughout. Long sleeves with small button cuffs at the wrists. V-neckline with a subtle collar. Lightweight cotton blend fabric with a slight sheen. Knee-length hem. No visible branding or text."
}

Step 3: Index the enriched view in your engine

# fashion_agent_engine.yaml
version: v2
name: fashion_search_agent
data:
  item_table:
    name: products_with_image_descriptions  # ← Use the vision-enriched view
    type: table
encoder:
  name: text-embedding-3-small
  provider: openai
  columns:
    - name: product_name
      weight: 0.3
    - name: category
      weight: 0.2
    - name: image_description  # ← This is what gets embedded and searched
      weight: 1.0

shaped create-engine --file fashion_agent_engine.yaml

Step 4: Query from your agent

# fashion_agent.py
import requests

SHAPED_API_KEY = "your-api-key"

def search_products(user_query: str, limit: int = 20):
    """
    Search products using vision-extracted descriptions.
    """
    response = requests.post(
        "https://api.shaped.ai/v2/rank",
        headers={"x-api-key": SHAPED_API_KEY},
        json={
            "engine_name": "fashion_search_agent",
            "query": user_query,
            "candidates": {
                "table": "products_with_image_descriptions"
            },
            "limit": limit
        }
    )
    
    results = response.json()
    
    products = []
    for result in results['results']:
        products.append({
            'product_id': result['product_id'],
            'product_name': result['product_name'],
            'price': result.get('price'),
            'image_url': result['image_url'],
            'description': result['image_description']
        })
    
    return products


# Usage
user_query = "navy blue dresses with floral patterns and long sleeves"
results = search_products(user_query, limit=10)

for product in results:
    print(f"{product['product_name']} - ${product['price']}")
    print(f"Description: {product['description']}")
    print(f"Image: {product['image_url']}\n")

What the agent gets back:

Summer Dress - $89.99
Description: Navy blue A-line dress with white and pink floral print throughout. Long sleeves with small button cuffs at the wrists. V-neckline with a subtle collar. Lightweight cotton blend fabric with a slight sheen. Knee-length hem.
Image: https://cdn.example.com/dresses/4821.jpg

Floral Maxi Dress - $124.99
Description: Deep navy maxi dress with large pink and white floral pattern. Long sleeves with elastic cuffs. Round neckline. Flowing A-line silhouette. Cotton-linen blend. Floor-length.
Image: https://cdn.example.com/dresses/5293.jpg

The agent now retrieves products based on visual content — color, pattern, sleeve length, neckline — all extracted automatically from the images.

Real-World Use Cases

Use Case 1: E-Commerce Fashion

Problem: 50,000 products uploaded by merchants with minimal metadata. Product name: “Dress” or “Top SKU 8472.” Images contain all the visual information.

AI View Solution:

{
  "name": "fashion_catalog_enriched",
  "view_type": "AI_ENRICHMENT",
  "source_table": "merchant_products",
  "source_columns": ["product_name", "image_url"],
  "enriched_output_columns": ["visual_description"],
  "prompt": "Describe this clothing item's color, pattern, style, cut, sleeve length, neckline, material texture, and length. Be specific and factual."
}

Result: Agents can search “red floral maxi dress” and retrieve exact matches, even when the product name is just “Dress #4821.”

Problem: Users upload millions of photos with no captions. An agent needs to surface “beach sunset photos” or “coffee shop interior shots” but there’s no text to search.

AI View Solution:

{
  "name": "user_photos_enriched",
  "view_type": "AI_ENRICHMENT",
  "source_table": "user_uploads",
  "source_columns": ["user_id", "upload_id", "image_url"],
  "enriched_output_columns": ["scene_description"],
  "prompt": "Describe the scene in this photo: location type (indoor/outdoor, beach, city, nature), time of day, weather, main subjects, colors, mood. Be concise."
}

Result: Agent can retrieve “sunset beach photos” by searching the scene_description field, which contains “Outdoor beach scene at sunset. Golden hour lighting. Ocean waves in background. Warm orange and pink sky.”

Use Case 3: Real Estate Listings

Problem: Property photos show “modern kitchen with marble countertops” but the database just stores kitchen_photo_1.jpg. Buyers search for “granite countertops” or “stainless appliances” and get no results.

AI View Solution:

{
  "name": "property_photos_enriched",
  "view_type": "AI_ENRICHMENT",
  "source_table": "property_listings",
  "source_columns": ["listing_id", "room_type", "image_url"],
  "enriched_output_columns": ["room_features"],
  "prompt": "Describe the room's key features: countertop material, appliance types, flooring, cabinetry, lighting fixtures, overall style (modern, traditional, rustic). Focus on buyer decision factors."
}

Result: Searches for “granite countertops” retrieve all kitchens with granite countertops, extracted from the photos automatically.

Use Case 4: Document Management

Problem: Scanned receipts, invoices, contracts stored as image files. No OCR, no searchable text. An agent can’t retrieve “invoices from Acme Corp” or “receipts over $500.”

AI View Solution:

{
  "name": "documents_ocr_enriched",
  "view_type": "AI_ENRICHMENT",
  "source_table": "scanned_documents",
  "source_columns": ["document_id", "image_url"],
  "enriched_output_columns": ["extracted_text", "document_summary"],
  "prompt": "Extract all visible text from this document. Then provide a 2-sentence summary of what type of document this is and the key information it contains (vendor name, amount, date, purpose)."
}

Result: Agent can search “invoices from Acme Corp over $500 in February 2026” and retrieve matching scanned documents based on extracted text.

Comparison Table

Component	Traditional (Manual Tagging)	Vision AI Views
Tagging method	Human views image, enters tags	Auto-extracted by vision model
Throughput	50-100 products/hour per person	500-1000 products/hour
Initial cost	$1,500-5,000 for 10K products	$50-150 for 10K products
Ongoing cost per image	$0.30-0.50 (labor)	$0.01-0.05 (API)
Error rate	15-25% (subjective)	5-10% (some hallucination)
Scalability at millions	Breaks (UGC impossible)	Handles millions (serverless)
Schema changes	New attributes = new columns	Prompt update, no schema change
Time to production	2-4 weeks	1-2 days
Lines of code	~600 (portal + logic)	~40 (YAML + prompt)

FAQ

Q: Which vision-language models does Shaped support?
A: Shaped’s AI Views use GPT-4V (OpenAI), Claude 3 Sonnet (Anthropic), and Gemini Vision (Google) depending on availability and your account configuration. All models support image analysis and text extraction.

Q: What if the vision model hallucinates details that aren’t in the image?
A: Vision models can occasionally infer details that aren’t visible (e.g., claiming “cotton fabric” when texture isn’t clear). Mitigate this by being specific in your prompt: “Only describe visible, factual details. If uncertain about material or other attributes, say ‘material unclear’.” The original image_url is always preserved so you can verify descriptions.

Q: How much does vision enrichment cost?
A: Vision model API calls cost ~$0.01-0.05 per image depending on model and image resolution. For a 10K product catalog, initial enrichment costs $100-500. Ongoing cost for new uploads is negligible unless you’re processing millions of images.

Q: Can I extract specific structured attributes instead of free-form descriptions?
A: Yes. Use a structured prompt: “Return a JSON object with these keys: color (string), pattern (string), sleeve_length (string: ‘sleeveless’, ‘short’, ‘long’), neckline (string), material (string).” The vision model will return structured data you can parse and store in separate columns.

Q: Does this work with low-quality images or partial views?
A: Vision models perform best with clear, well-lit product photos. Low-resolution images, extreme angles, or obstructed views reduce accuracy. For e-commerce, use the primary product image (front view, good lighting). For user-generated content, descriptions may be less detailed but still useful.

Q: Can I update descriptions when images change?
A: Yes. AI Views re-run enrichment when source rows update. If you replace a product image (new image_url), the vision model re-analyzes it and updates the image_description column within 30 seconds.

Q: What about images with text overlay (e.g., promotional banners)?
A: Vision models can read text in images. Update your prompt to handle this: “Describe the product visible in the image. If there is promotional text or price overlay, ignore it and focus only on the product itself.”

Q: How do I handle multi-image products (e.g., 5 photos per listing)?
A: Create multiple rows in your source table (one per image) or concatenate descriptions from multiple images. For example, analyze all 5 images and combine: “Front view: navy blue dress with floral print. Back view: zipper closure visible. Detail shot: cotton blend fabric tag.”

Conclusion

The visual blindness problem is structural: your database stores images, your agent can only search text. The traditional fix — manual tagging by humans — works for small catalogs, but it’s slow, expensive, error-prone, and doesn’t scale to user-generated content.

Vision AI Views solve this at the ingestion layer. A vision-language model analyzes each image once, extracts a textual description, and materializes it as a searchable field. The agent retrieves visual context in a single query — no manual tagging, no rigid schemas, no vision model call at query time.

Multimodal Agents: Extracting Agent Context from Images

Quick Answer: Why Your Agent Can’t See

Table of Contents

The Visual Blindness Problem

What Happens in Production

Why This Happens Everywhere

What Multimodal AI Views Do

The Pipeline

Key Properties

Part 1: The Traditional Approach (Manual Tagging)

Architecture

Implementation

What You’re Operating

Part 2: The Shaped Way — Vision AI Views

Architecture

Implementation

Real-World Use Cases

Use Case 1: E-Commerce Fashion

Use Case 3: Real Estate Listings

Use Case 4: Document Management

Comparison Table

FAQ

Conclusion

Get up and running with one engineer in one sprint

Related Posts

10 Best Practices in Data Ingestion: A Scalable Framework for Real-Time, Reliable Pipelines

5 Best APIs for Adding Personalized Recommendations to Your App in 2025

Action is All You Need: Dual-Flow Generative Ranking Network for Recommendation

Multimodal Agents: Extracting Agent Context from Images

Quick Answer: Why Your Agent Can’t See

Table of Contents

The Visual Blindness Problem

What Happens in Production

Why This Happens Everywhere

What Multimodal AI Views Do

The Pipeline

Key Properties

Part 1: The Traditional Approach (Manual Tagging)

Architecture

Implementation

What You’re Operating

Part 2: The Shaped Way — Vision AI Views

Architecture

Implementation

Real-World Use Cases

Use Case 1: E-Commerce Fashion

Use Case 2: Social Media Content Discovery

Use Case 3: Real Estate Listings

Use Case 4: Document Management

Comparison Table

FAQ

Conclusion

Get up and running with one engineer in one sprint

Related Posts

10 Best Practices in Data Ingestion: A Scalable Framework for Real-Time, Reliable Pipelines

5 Best APIs for Adding Personalized Recommendations to Your App in 2025

Action is All You Need: Dual-Flow Generative Ranking Network for Recommendation