Quick Answer: Why Your Agent Can’t See
An AI agent answers from the context it retrieves. If your product catalog has images but no textual descriptions, the agent can’t search them. A user asks “show me navy blue dresses with floral patterns” — your database has the images, but the agent retrieves nothing because there’s no text to match.
That’s not a model problem. That’s a visual blindness problem — your images contain semantic information, but your agent can only search text.
Key Takeaways:
- Images are semantically rich but unsearchable — A product photo contains color, style, pattern, material, but agents can’t retrieve it without text
- Vision-language models extract searchable descriptions — GPT-4V, Claude 3, Gemini Vision analyze images and generate text descriptions
- AI Views materialize visual context at ingestion — Extract descriptions once when images are uploaded, not every time the agent queries
- Works on any visual data — Product photos, user-generated content, social media images, document scans, receipts
- Zero inference-time cost — The agent retrieves pre-extracted descriptions, no vision model call during search
Time to read: 20 minutes | Includes: 7 code examples, 2 architecture diagrams, 1 comparison table
Table of Contents
- The Visual Blindness Problem
- What Multimodal AI Views Do
- Part 1: The Traditional Approach
- Part 2: The Shaped Way — Vision AI Views
- Real-World Use Cases
- Comparison Table
- FAQ
The Visual Blindness Problem
Imagine an e-commerce fashion agent. A user searches:
User: “Show me navy blue dresses with floral patterns and long sleeves”
The product database has 50,000 dress images. 127 of them match this description. But here’s what the database actually stores:
{
"product_id": "DRESS-4821",
"name": "Summer Dress",
"category": "Dresses",
"price": 89.99,
"image_url": "https://cdn.example.com/dresses/4821.jpg",
"in_stock": true
}
The image_url points to a photo of a navy blue floral dress with long sleeves. But there’s no text field that says “navy blue,” “floral pattern,” or “long sleeves.” The agent can’t retrieve it.
What Happens in Production
Agent query: “navy blue dresses with floral patterns”
Vector search result: Returns nothing, because there’s no textual match for “navy blue” or “floral” in the indexed fields.
Agent response:
“I couldn’t find any navy blue floral dresses. Would you like to browse our dress collection?”
What the user should have seen:
“Here are 127 navy blue floral dresses. The top match is our Summer Maxi Dress with a delicate floral print and long sleeves, currently $89.99.”
The information exists — it’s in the image. But it’s locked away in pixels, invisible to text-based retrieval.
Why This Happens Everywhere
E-commerce: Product images uploaded by merchants often have minimal metadata. The image shows “burgundy velvet blazer with gold buttons” but the product name is just “Blazer - SKU 8472.”
Social media: User-generated content is entirely visual. A photo of a sunset over a beach has no caption. An agent can’t retrieve it for “beach sunset photos” because there’s no text to match.
Real estate: Listing photos show “modern kitchen with marble countertops and stainless steel appliances” but the database field just says kitchen_photo_1.jpg.
Document management: Scanned receipts, invoices, contracts contain critical text and visual layout, but they’re stored as image files with no searchable content.
The semantic gap:
Image contains → Navy blue dress, floral pattern, long sleeves, A-line cut
Database has → "Summer Dress", category: "Dresses"
Agent retrieves → Nothing (no text match)
What Multimodal AI Views Do
A multimodal AI View uses a vision-language model (like GPT-4V, Claude 3 Sonnet, or Gemini Vision) to analyze images and generate textual descriptions. These descriptions are materialized as searchable fields in your index.
The Pipeline
dress-4821.jpg
Vision Enrichment
Description:
“Navy blue dress,
floral print,
long sleeves”
(Searchable)
“navy blue
floral dresses”
Key Properties
1. Write-time extraction The vision model runs when the image is uploaded, not when the agent queries. Extraction cost is paid once per image, not on every search.
2. Materialized as text The image description is stored as a searchable text field. The agent retrieves it using standard vector search — no vision model call during retrieval.
3. Automatic updates When a product image changes, the AI View re-extracts the description. The index stays current with the visual content.
4. Source images stay untouched
The original image_url is preserved. The AI View adds a new image_description column — it doesn’t replace anything.
Part 1: The Traditional Approach (Manual Tagging)
The standard solution is to manually tag images with textual attributes. A human (or a team of humans) looks at each product image and fills in structured fields.
Architecture
Portal:
Tag Image
Fields:
color, pattern,
size, material
Storage:
Tags indexed
Retrieval:
This works for small catalogs. It breaks at scale.
Implementation
Step 1: Tagging interface
# merchant_tagging_portal.py
from flask import Flask, render_template, request
import psycopg2
app = Flask(__name__)
@app.route('/tag-product/<product_id>')
def tag_product(product_id):
"""
Show tagging form for a product image.
"""
conn = psycopg2.connect("dbname=ecommerce user=admin")
cursor = conn.cursor()
cursor.execute("""
SELECT product_id, product_name, image_url
FROM products
WHERE product_id = %s
""", (product_id,))
product = cursor.fetchone()
return render_template('tag_form.html', product=product)
@app.route('/save-tags/<product_id>', methods=['POST'])
def save_tags(product_id):
"""
Save manually entered tags to database.
"""
tags = {
'color': request.form.get('color'),
'pattern': request.form.get('pattern'),
'sleeve_length': request.form.get('sleeve_length'),
'neckline': request.form.get('neckline'),
'material': request.form.get('material'),
'style': request.form.get('style'),
'length': request.form.get('length')
}
conn = psycopg2.connect("dbname=ecommerce user=admin")
cursor = conn.cursor()
cursor.execute("""
UPDATE products
SET color = %s, pattern = %s, sleeve_length = %s,
neckline = %s, material = %s, style = %s, length = %s,
tags_updated_at = NOW()
WHERE product_id = %s
""", (tags['color'], tags['pattern'], tags['sleeve_length'],
tags['neckline'], tags['material'], tags['style'],
tags['length'], product_id))
conn.commit()
return "Tags saved"
Step 2: Tag schema
-- product_tags.sql
ALTER TABLE products ADD COLUMN color VARCHAR(100);
ALTER TABLE products ADD COLUMN pattern VARCHAR(100);
ALTER TABLE products ADD COLUMN sleeve_length VARCHAR(50);
ALTER TABLE products ADD COLUMN neckline VARCHAR(50);
ALTER TABLE products ADD COLUMN material VARCHAR(100);
ALTER TABLE products ADD COLUMN style VARCHAR(100);
ALTER TABLE products ADD COLUMN length VARCHAR(50);
ALTER TABLE products ADD COLUMN tags_updated_at TIMESTAMP;
CREATE INDEX idx_products_color ON products(color);
CREATE INDEX idx_products_pattern ON products(pattern);
CREATE INDEX idx_products_material ON products(material);
Step 3: Agent retrieval with structured tags
# agent_search.py
import psycopg2
from sentence_transformers import SentenceTransformer
def search_products(query: str):
"""
Search products using manually tagged attributes.
"""
# Extract attributes from natural language query
# (This requires NLP parsing or keyword matching)
attributes = extract_attributes(query)
# Example: {"color": "navy blue", "pattern": "floral", "sleeve_length": "long"}
conn = psycopg2.connect("dbname=ecommerce user=admin")
cursor = conn.cursor()
# Build SQL WHERE clause from extracted attributes
where_clauses = []
params = []
if 'color' in attributes:
where_clauses.append("color ILIKE %s")
params.append(f"%{attributes['color']}%")
if 'pattern' in attributes:
where_clauses.append("pattern ILIKE %s")
params.append(f"%{attributes['pattern']}%")
if 'sleeve_length' in attributes:
where_clauses.append("sleeve_length = %s")
params.append(attributes['sleeve_length'])
where_sql = " AND ".join(where_clauses) if where_clauses else "1=1"
cursor.execute(f"""
SELECT product_id, product_name, image_url, color, pattern, sleeve_length
FROM products
WHERE {where_sql}
LIMIT 20
""", params)
results = cursor.fetchall()
return results
What You’re Operating
| Component | What It Is | Failure Mode |
|---|---|---|
| Tagging portal | Flask app for manual data entry | Slow, inconsistent, expensive |
| Human taggers | Team of people viewing images | High error rate, terminology inconsistency |
| Structured schema | Fixed columns for each attribute | Rigid, can’t handle new product types |
| Attribute extraction | NLP to parse queries into filters | Misses synonyms (“long sleeve” vs “full sleeve”) |
| Tag maintenance | Update tags when images change | Often forgotten, tags go stale |
The cost:
- Tagging throughput: 1 human tagger can tag ~50-100 products/hour at $15-25/hour
- For 10K products: 100-200 hours of tagging = $1,500-5,000 initial cost
- Ongoing: Every new product upload requires manual tagging before it’s searchable
- Quality: 15-25% error rate on subjective attributes like “style” or “pattern”
- Scalability: Breaks completely for user-generated content (can’t manually tag millions of social media images)
Part 2: The Shaped Way — Vision AI Views
Shaped’s multimodal AI Views use vision-language models to automatically extract descriptions from images at ingestion time. You define a view that includes image_url in the source columns, write a prompt describing what to extract, and the vision model generates textual descriptions that get indexed.
Architecture
Ingestion
Vision Enrichment
(source: image_url)
(GPT-4V /
Claude 3)
analyzes image
Description
(“Navy blue
dress, floral
pattern…”)
(image_description
field materialized)
Vector search on
image_description
Implementation
Step 1: Connect your product table
# products_table.yaml
version: v2
name: products
schema_type: POSTGRES
host: postgres.example.com
port: 5432
database: ecommerce
table_name: products
schema:
- name: product_id
type: STRING
- name: product_name
type: STRING
- name: category
type: STRING
- name: price
type: FLOAT
- name: image_url
type: STRING # ← Critical: vision model reads from this column
# Must contain publicly accessible URLs (https://...)
- name: in_stock
type: BOOLEAN
- name: created_at
type: TIMESTAMP
shaped create-table --file products_table.yaml
Step 2: Create the vision AI View
curl -X POST "https://api.shaped.ai/v2/views" \
-H "Content-Type: application/json" \
-H "x-api-key: $SHAPED_API_KEY" \
-d '{
"name": "products_with_image_descriptions",
"view_type": "AI_ENRICHMENT",
"source_table": "products",
"source_columns": [
"product_id",
"product_name",
"category",
"image_url"
],
"source_columns_in_output": [
"product_id",
"product_name",
"category",
"price",
"image_url"
],
"enriched_output_columns": [
"image_description"
],
"prompt": "Analyze the product image and describe the item'\''s visual characteristics in detail. Focus on:\n- Primary color and any accent colors\n- Pattern (solid, floral, striped, geometric, etc.)\n- Style and cut (A-line, fitted, oversized, etc.)\n- Sleeve length (sleeveless, short, 3/4, long)\n- Neckline type (crew, V-neck, scoop, etc.)\n- Material texture and finish (cotton, silk, leather, etc.)\n- Length (mini, knee-length, midi, maxi)\n- Any visible branding, text, or distinctive features\n\nBe factual and specific. Avoid marketing language."
}'
What this does:
- Reads
image_urland other metadata from theproductstable - The
image_urlcolumn must contain publicly accessible URLs (the vision model fetches images from these URLs) - Note: If your image column has a different name (e.g.,
photo_url,img_path), create an SQL view first to rename it toimage_url - Passes the image to a vision-language model (GPT-4V or Claude 3 Sonnet)
- Vision model analyzes the image and generates a description following the prompt instructions
- Description is stored in the
image_descriptioncolumn - This enrichment runs once when the product is uploaded or updated
- The enriched view
products_with_image_descriptionsis now a queryable table
Output example:
Given this input row:
{
"product_id": "DRESS-4821",
"product_name": "Summer Dress",
"category": "Dresses",
"price": 89.99,
"image_url": "https://cdn.example.com/dresses/4821.jpg"
}
The AI View generates:
{
"product_id": "DRESS-4821",
"product_name": "Summer Dress",
"category": "Dresses",
"price": 89.99,
"image_url": "https://cdn.example.com/dresses/4821.jpg",
"image_description": "Navy blue A-line dress with white and pink floral print throughout. Long sleeves with small button cuffs at the wrists. V-neckline with a subtle collar. Lightweight cotton blend fabric with a slight sheen. Knee-length hem. No visible branding or text."
}
Step 3: Index the enriched view in your engine
# fashion_agent_engine.yaml
version: v2
name: fashion_search_agent
data:
item_table:
name: products_with_image_descriptions # ← Use the vision-enriched view
type: table
encoder:
name: text-embedding-3-small
provider: openai
columns:
- name: product_name
weight: 0.3
- name: category
weight: 0.2
- name: image_description # ← This is what gets embedded and searched
weight: 1.0
shaped create-engine --file fashion_agent_engine.yaml
Step 4: Query from your agent
# fashion_agent.py
import requests
SHAPED_API_KEY = "your-api-key"
def search_products(user_query: str, limit: int = 20):
"""
Search products using vision-extracted descriptions.
"""
response = requests.post(
"https://api.shaped.ai/v2/rank",
headers={"x-api-key": SHAPED_API_KEY},
json={
"engine_name": "fashion_search_agent",
"query": user_query,
"candidates": {
"table": "products_with_image_descriptions"
},
"limit": limit
}
)
results = response.json()
products = []
for result in results['results']:
products.append({
'product_id': result['product_id'],
'product_name': result['product_name'],
'price': result.get('price'),
'image_url': result['image_url'],
'description': result['image_description']
})
return products
# Usage
user_query = "navy blue dresses with floral patterns and long sleeves"
results = search_products(user_query, limit=10)
for product in results:
print(f"{product['product_name']} - ${product['price']}")
print(f"Description: {product['description']}")
print(f"Image: {product['image_url']}\n")
What the agent gets back:
Summer Dress - $89.99
Description: Navy blue A-line dress with white and pink floral print throughout. Long sleeves with small button cuffs at the wrists. V-neckline with a subtle collar. Lightweight cotton blend fabric with a slight sheen. Knee-length hem.
Image: https://cdn.example.com/dresses/4821.jpg
Floral Maxi Dress - $124.99
Description: Deep navy maxi dress with large pink and white floral pattern. Long sleeves with elastic cuffs. Round neckline. Flowing A-line silhouette. Cotton-linen blend. Floor-length.
Image: https://cdn.example.com/dresses/5293.jpg
The agent now retrieves products based on visual content — color, pattern, sleeve length, neckline — all extracted automatically from the images.
Real-World Use Cases
Use Case 1: E-Commerce Fashion
Problem: 50,000 products uploaded by merchants with minimal metadata. Product name: “Dress” or “Top SKU 8472.” Images contain all the visual information.
AI View Solution:
{
"name": "fashion_catalog_enriched",
"view_type": "AI_ENRICHMENT",
"source_table": "merchant_products",
"source_columns": ["product_name", "image_url"],
"enriched_output_columns": ["visual_description"],
"prompt": "Describe this clothing item's color, pattern, style, cut, sleeve length, neckline, material texture, and length. Be specific and factual."
}
Result: Agents can search “red floral maxi dress” and retrieve exact matches, even when the product name is just “Dress #4821.”
Use Case 2: Social Media Content Discovery
Problem: Users upload millions of photos with no captions. An agent needs to surface “beach sunset photos” or “coffee shop interior shots” but there’s no text to search.
AI View Solution:
{
"name": "user_photos_enriched",
"view_type": "AI_ENRICHMENT",
"source_table": "user_uploads",
"source_columns": ["user_id", "upload_id", "image_url"],
"enriched_output_columns": ["scene_description"],
"prompt": "Describe the scene in this photo: location type (indoor/outdoor, beach, city, nature), time of day, weather, main subjects, colors, mood. Be concise."
}
Result: Agent can retrieve “sunset beach photos” by searching the scene_description field, which contains “Outdoor beach scene at sunset. Golden hour lighting. Ocean waves in background. Warm orange and pink sky.”
Use Case 3: Real Estate Listings
Problem: Property photos show “modern kitchen with marble countertops” but the database just stores kitchen_photo_1.jpg. Buyers search for “granite countertops” or “stainless appliances” and get no results.
AI View Solution:
{
"name": "property_photos_enriched",
"view_type": "AI_ENRICHMENT",
"source_table": "property_listings",
"source_columns": ["listing_id", "room_type", "image_url"],
"enriched_output_columns": ["room_features"],
"prompt": "Describe the room's key features: countertop material, appliance types, flooring, cabinetry, lighting fixtures, overall style (modern, traditional, rustic). Focus on buyer decision factors."
}
Result: Searches for “granite countertops” retrieve all kitchens with granite countertops, extracted from the photos automatically.
Use Case 4: Document Management
Problem: Scanned receipts, invoices, contracts stored as image files. No OCR, no searchable text. An agent can’t retrieve “invoices from Acme Corp” or “receipts over $500.”
AI View Solution:
{
"name": "documents_ocr_enriched",
"view_type": "AI_ENRICHMENT",
"source_table": "scanned_documents",
"source_columns": ["document_id", "image_url"],
"enriched_output_columns": ["extracted_text", "document_summary"],
"prompt": "Extract all visible text from this document. Then provide a 2-sentence summary of what type of document this is and the key information it contains (vendor name, amount, date, purpose)."
}
Result: Agent can search “invoices from Acme Corp over $500 in February 2026” and retrieve matching scanned documents based on extracted text.
Comparison Table
| Component | Traditional (Manual Tagging) | Vision AI Views |
|---|---|---|
| Tagging method | Human views image, enters tags | Auto-extracted by vision model |
| Throughput | 50-100 products/hour per person | 500-1000 products/hour |
| Initial cost | $1,500-5,000 for 10K products | $50-150 for 10K products |
| Ongoing cost per image | $0.30-0.50 (labor) | $0.01-0.05 (API) |
| Error rate | 15-25% (subjective) | 5-10% (some hallucination) |
| Scalability at millions | Breaks (UGC impossible) | Handles millions (serverless) |
| Schema changes | New attributes = new columns | Prompt update, no schema change |
| Time to production | 2-4 weeks | 1-2 days |
| Lines of code | ~600 (portal + logic) | ~40 (YAML + prompt) |
FAQ
Q: Which vision-language models does Shaped support?
A: Shaped’s AI Views use GPT-4V (OpenAI), Claude 3 Sonnet (Anthropic), and Gemini Vision (Google) depending on availability and your account configuration. All models support image analysis and text extraction.
Q: What if the vision model hallucinates details that aren’t in the image?
A: Vision models can occasionally infer details that aren’t visible (e.g., claiming “cotton fabric” when texture isn’t clear). Mitigate this by being specific in your prompt: “Only describe visible, factual details. If uncertain about material or other attributes, say ‘material unclear’.” The original image_url is always preserved so you can verify descriptions.
Q: How much does vision enrichment cost?
A: Vision model API calls cost ~$0.01-0.05 per image depending on model and image resolution. For a 10K product catalog, initial enrichment costs $100-500. Ongoing cost for new uploads is negligible unless you’re processing millions of images.
Q: Can I extract specific structured attributes instead of free-form descriptions?
A: Yes. Use a structured prompt: “Return a JSON object with these keys: color (string), pattern (string), sleeve_length (string: ‘sleeveless’, ‘short’, ‘long’), neckline (string), material (string).” The vision model will return structured data you can parse and store in separate columns.
Q: Does this work with low-quality images or partial views?
A: Vision models perform best with clear, well-lit product photos. Low-resolution images, extreme angles, or obstructed views reduce accuracy. For e-commerce, use the primary product image (front view, good lighting). For user-generated content, descriptions may be less detailed but still useful.
Q: Can I update descriptions when images change?
A: Yes. AI Views re-run enrichment when source rows update. If you replace a product image (new image_url), the vision model re-analyzes it and updates the image_description column within 30 seconds.
Q: What about images with text overlay (e.g., promotional banners)?
A: Vision models can read text in images. Update your prompt to handle this: “Describe the product visible in the image. If there is promotional text or price overlay, ignore it and focus only on the product itself.”
Q: How do I handle multi-image products (e.g., 5 photos per listing)?
A: Create multiple rows in your source table (one per image) or concatenate descriptions from multiple images. For example, analyze all 5 images and combine: “Front view: navy blue dress with floral print. Back view: zipper closure visible. Detail shot: cotton blend fabric tag.”
Conclusion
The visual blindness problem is structural: your database stores images, your agent can only search text. The traditional fix — manual tagging by humans — works for small catalogs, but it’s slow, expensive, error-prone, and doesn’t scale to user-generated content.
Vision AI Views solve this at the ingestion layer. A vision-language model analyzes each image once, extracts a textual description, and materializes it as a searchable field. The agent retrieves visual context in a single query — no manual tagging, no rigid schemas, no vision model call at query time.