Quick Answer: Visual Embeddings + Collaborative Signals + AI-Extracted Semantics
A user saves a pin of a mid-century modern living room. Thirty seconds later, they’re scrolling through a feed of related pins: similar furniture styles, matching color palettes, rooms with the same cozy-but-minimal aesthetic. Some of these pins were uploaded this morning — zero saves, zero clicks, no engagement history at all.
How?
Pinterest’s Related Pins blends three types of intelligence:
- Visual similarity (CLIP embeddings): “This image looks like that image” — color, composition, style, objects
- Collaborative filtering (Two-Tower): “Users who saved this also saved…”
- Semantic understanding (AI-extracted features): Style labels, dominant colors, mood, room type — pulled directly from the image by a vision-language model
Neither works alone:
- Visual similarity alone surfaces anything that looks similar — a red shoe and a red car share color features, but nobody wants a car when shopping for shoes
- Collaborative filtering alone can’t recommend new pins (zero interactions = zero signal)
- Text metadata alone is often empty on Pinterest — most pins have no description, just an image
Multimodal discovery fuses all three signals. The result: new content enters the discovery loop from the moment it’s uploaded, and established content benefits from behavioral intelligence that captures taste patterns no visual model can learn.
Key Takeaways:
- CLIP embeddings give every image a visual fingerprint instantly — no interaction history needed
- AI Views fill the metadata gap — vision-language models extract style, objects, mood, and color from images that have no title or description
- Two-Tower models learn what CLIP can’t — personalized taste patterns from save/click behavior
- Score ensembles in ShapedQL blend all three signals adaptively — more visual weight for new pins, more collaborative weight for established ones
- One engine, one config — no separate FAISS index, no ALS retraining pipeline, no TF-IDF
Time to read: 22 minutes | Includes: 9 code examples, 2 architecture diagrams, 1 comparison table
This is Part 2 of the “How to Build” series. Part 1 covers Spotify’s Discover Weekly with hybrid filtering. This article focuses on image-first platforms where visual similarity is the primary signal.
Table of Contents
- Why Visual Discovery Is a Different Problem
- Why Pure Visual Similarity Fails
- Why Pure Collaborative Filtering Fails
- Part 1: The Traditional Approach (and Why It Hurts)
- Part 2: The Shaped Way — CLIP + AI Views + Two-Tower
- Building the System End-to-End
- Score Ensemble Strategies
- Comparison: Traditional vs. Shaped
- FAQ
Why Visual Discovery Is a Different Problem
Most recommendation systems start with structured data: product catalogs with brand, category, price, and size. You can filter, facet, and match on those attributes.
Pinterest doesn’t have that luxury. The core unit of content is an image. And images don’t come with structured metadata. A pin of a beautiful kitchen doesn’t have a style: scandinavian field. It has pixels.
The Pinterest Discovery Loop
A user sees a pin. They save it. Pinterest must now answer the hardest question in visual recommendation:
What makes this pin interesting to this user?
Is it the furniture style? The color palette? The room layout? The specific lamp in the corner? The vibe?
The answer varies by user. Two people save the same photo of a rustic farmhouse kitchen:
- User A (renovating their house) wants to see similar kitchen layouts and cabinet styles
- User B (a food blogger) wants to see recipe pins shot in similar-looking kitchens
This is why text-based recommendations fail on visual platforms. The interesting features are latent — they live inside the pixels, and their relevance depends on who’s looking.
What Related Pins Actually Does
When you tap a pin on Pinterest, the Related Pins feed blends:
- Visual features — CLIP embeddings capture composition, color, objects, style
- Collaborative signals — Save/click co-occurrence patterns learned from user behavior
- Semantic signals — Pin descriptions, board names, and extracted text
- Engagement signals — Save rate, click-through rate, close-up rate
When a new pin is uploaded with zero saves, visual features and semantic signals carry the load. As the pin accumulates interactions, collaborative signals take over. The blend shifts automatically.
For a deep dive into how Pinterest’s production retrieval model actually works under the hood, see our PinRec teardown.
Why Pure Visual Similarity Fails
CLIP (Contrastive Language-Image Pre-training) encodes images and text into the same embedding space. Two images with similar CLIP embeddings look visually similar. This is powerful — but it has three failure modes that prevent it from working alone.
Failure 1: Visual similarity does not equal intent similarity
A user saves a pin of a white minimalist kitchen. Pure CLIP similarity returns:
| Rank | Result | Relevant? |
|---|---|---|
| 1 | White minimalist kitchen with marble island | Yes |
| 2 | White minimalist kitchen with pendant lights | Yes |
| 3 | White minimalist bathroom (similar palette) | No — wrong room |
| 4 | White product photography studio (similar lighting) | No — wrong category |
| 5 | White minimalist kitchen from a real estate listing | Borderline |
CLIP captures what the image looks like, not what the user wants from it. Visually similar does not equal useful.
Failure 2: No personalization, no intent
CLIP embeddings describe the image, not the user. Everyone who taps the same pin gets identical results — there’s no “users like you tend to prefer…” signal.
Worse, CLIP can’t distinguish why someone saved a pin. A blue velvet sofa in a styled living room gets saved for the fabric, the layout, and the brand — by three different users with three different needs. CLIP returns the same visually similar images for all three.
| Limitation | Impact |
|---|---|
| Visual does not equal intent | Irrelevant results across categories (kitchens to bathrooms) |
| No personalization | Same results for every user, every time |
| Ambiguous intent | Can’t tell why a user saved a pin — fabric? layout? brand? |
Why Pure Collaborative Filtering Fails
Collaborative filtering learns from behavior: “Users who saved Pin A also saved Pin B.” It captures taste patterns that no visual model can learn — but it has a fatal weakness on visual platforms.
New pin uploaded: stunning sunset over Santorini, uploaded 5 minutes ago. Collaborative filtering has zero interaction data. The pin never enters Related Pins. By the time enough users save it, it’s no longer new.
This cold-start problem is worse on Pinterest than on music or e-commerce platforms. Those have finite catalogs. Pinterest has an effectively infinite stream of user-generated images — and collaborative filtering is blind to all of them until humans interact.
The other limitations compound: popularity bias means the top 1% of pins dominate recommendations, sparse interactions (most pins have fewer than 5 saves) produce noisy signals, and collaborative filtering has no visual understanding — two co-saved pins might look completely different.
If you’ve read our Discover Weekly hybrid filtering playbook, you know the cold-start story. The difference on visual platforms is that the solution requires a visual signal, not just content enrichment on text metadata.
Part 1: The Traditional Approach (and Why It Hurts)
The traditional approach runs three separate systems and merges their outputs in application code.
Architecture:
→ Rank by final score → Top 20 related pins
Here’s what that implementation looks like in practice — and why every step is a maintenance burden:
# related_pins_traditional.py — The pain in 40 lines
def get_related_pins(pin_id, n=20):
# 1. Visual: query CLIP FAISS index (you manage GPU inference + index updates)
clip_embedding = clip_embeddings[pin_id]
visual_distances, visual_ids = faiss_index.search(clip_embedding.reshape(1, -1), 100)
# 2. Collaborative: compute ALS dot products (you retrain weekly)
collab_scores = {}
pin_factor = als_model.item_factors[pin_id]
for cid in visual_ids[0]:
collab_scores[cid] = float(np.dot(pin_factor, als_model.item_factors[cid]))
# 3. Text: TF-IDF cosine similarity (pin descriptions are often empty)
text_scores = {}
for cid in visual_ids[0]:
text_scores[cid] = cosine_similarity(text_vectors[pin_id], text_vectors[cid])[0][0]
# 4. Blend in application code (hardcoded weights, no adaptivity)
blended = {}
for cid in visual_ids[0]:
blended[cid] = (
0.5 * normalize(visual_distances[0][list(visual_ids[0]).index(cid)])
+ 0.3 * normalize(collab_scores.get(cid, 0))
+ 0.2 * normalize(text_scores.get(cid, 0))
)
return sorted(blended.items(), key=lambda x: x[1], reverse=True)[:n]
This is ~40 lines. But behind those 40 lines, you’re managing:
| What You Maintain | What It Costs |
|---|---|
| CLIP inference pipeline | GPU provisioning, model serving, batch processing for new pins |
| FAISS index | Incremental updates are hard; most teams rebuild nightly |
| ALS model | Weekly retraining, sparse matrix construction, model serialization |
| TF-IDF index | Useless when pin descriptions are empty (most of the time on Pinterest) |
| Blending logic | Hardcoded in application code; changing weights requires a deploy |
| Three separate data pipelines | ETL, monitoring, alerting — times three |
The real cost isn’t the code. It’s the infrastructure. Three models, three indexes, three update pipelines, one fragile blending function, and no adaptivity.
Part 2: The Shaped Way — CLIP + AI Views + Two-Tower
Shaped replaces all three systems with a single engine that computes CLIP embeddings, extracts visual features via AI Views, trains collaborative models, and blends everything in ShapedQL.
Architecture:
(openai/clip-vit-base-patch32)
(VLM extracts style, colors, objects, mood)
(sentence-transformers/modernbert)
(auto-trained, uses item features)
Three key differences from the traditional approach:
-
AI Views fill the metadata gap. Most pins have no description. Shaped’s AI View runs a vision-language model over every pin image and extracts structured attributes — style, dominant colors, objects, room type, mood — automatically. This turns “untitled image with zero metadata” into a richly described item that text embeddings can reason about.
-
CLIP embeddings are managed infrastructure. No FAISS. No GPU provisioning. Declare
openai/clip-vit-base-patch32in your engine config and Shaped handles inference, indexing, and incremental updates. -
Score blending is a query, not a deploy. Changing from 50/30/20 to 60/20/20 is a one-line edit in ShapedQL. No model retraining, no redeployment.
Step 1: AI Views — Turning Images Into Searchable Features
This is the Shaped differentiator for visual platforms. Pinterest’s core problem is that images lack metadata. AI Views solve this by analyzing the actual pixels.
# views/pin_image_enrichment.yaml
version: v2
name: pin_image_enrichment
view_type: AI_ENRICHMENT
source_table: pins
source_columns:
- item_id
- title
- image_url
source_columns_in_output:
- item_id
- title
enriched_output_columns:
- visual_description
prompt: |
Analyze the pin image and extract visual attributes as a concise description.
Include: primary design style (e.g., minimalist, bohemian, industrial, mid-century modern),
dominant colors and color palette, key objects and furniture visible,
room type or scene category, mood or aesthetic (e.g., cozy, dramatic, airy),
and notable patterns, textures, or materials.
Focus on attributes that help match this pin with visually similar content.
Keep the description factual — no marketing language.
Shaped runs a vision-language model over every pin image and materializes the result. Here’s what the output looks like for real pins:
Pin A: No title, no description — just an image of a living room.
| Field | Value |
|---|---|
item_id | pin_8829 |
title | (empty) |
visual_description | Mid-century modern living room. Walnut credenza, mustard yellow accent chair, brass pendant lighting. White walls, warm wood floors. Geometric rug in cream and rust. Clean lines, organic shapes. Warm, curated, 1960s influence. |
Pin B: Title is just a plant emoji — not useful for matching.
| Field | Value |
|---|---|
item_id | pin_3341 |
title | (plant emoji) |
visual_description | Indoor plant arrangement on wooden floating shelf. Monstera deliciosa, trailing pothos, snake plant in terracotta pots and woven baskets. White wall backdrop, natural light from left. Bohemian, earthy, organic aesthetic. |
Pin C: Title says “inspo” — not useful for matching.
| Field | Value |
|---|---|
item_id | pin_7102 |
title | inspo |
visual_description | Scandinavian kitchen. White oak cabinets, terrazzo countertops, matte black faucet. Open shelving with ceramic dishes. Linen pendant lights. Sage green subway tile backsplash. Minimal, airy, natural materials. |
Why this matters: Without the AI View, pins A, B, and C have no usable text features. TF-IDF on an empty title returns nothing. But with AI-enriched descriptions, text embeddings can now compute meaningful semantic similarity: Pin A and Pin C both describe “minimal” aesthetics with “natural materials” — they’ll appear in each other’s related feeds even though one is a living room and the other is a kitchen, because the style matches.
For more on configuring AI Views — including multi-column enrichment and prompt best practices — see the AI enrichment documentation.
Step 2: Configure the Engine — CLIP + Text + Two-Tower
# engines/related_pins.yaml
version: v2
name: related_pins
data:
item_table:
name: pins
type: table
user_table:
name: users
type: table
interaction_table:
name: pin_interactions
type: table
schema_override:
item:
id: item_id
features:
- name: image_url
type: Image
- name: title
type: Text
- name: board_category
type: TextCategory
created_at: created_at
interaction:
id: interaction_id
item_id: item_id
user_id: user_id
label: interaction_type
created_at: created_at
index:
embeddings:
# 1. Visual: CLIP embeddings on pin images
- name: clip_embedding
encoder:
type: hugging_face
model_name: openai/clip-vit-base-patch32
batch_size: 32
item_fields:
- image_url
# 2. Semantic: Text embeddings on AI-enriched descriptions
- name: content_embedding
encoder:
type: hugging_face
model_name: sentence-transformers/modernbert
batch_size: 256
item_fields:
- pin_image_enrichment.visual_description
# 3. Collaborative: Trained Two-Tower embeddings
- name: twotower_embedding
encoder:
type: trained_model
model_ref: twotower_collab
training:
models:
- name: twotower_collab
policy_type: two_tower
strategy: early_stopping
When you run shaped create-engine --file engines/related_pins.yaml, Shaped:
- Computes CLIP embeddings for every pin image (batch_size: 32 — smaller batches for image processing)
- Generates sentence transformer embeddings on AI-enriched visual descriptions (batch_size: 256 for text)
- Trains a Two-Tower model on pin_interactions — learning user-item affinities that blend collaborative and content signals
- Builds ANN indexes for all three embedding spaces
- Auto-updates when new pins arrive — no manual re-indexing
Why Two-Tower instead of ELSA or ALS? Pure collaborative models like ELSA and ALS learn item relationships solely from co-occurrence patterns — they ignore item features entirely. Two-Tower is different: it separates user and item computation into two neural networks, and the item tower can incorporate rich features like image embeddings, board categories, and text descriptions alongside interaction data. On a visual platform where item features (the image itself) carry enormous signal, Two-Tower produces collaborative embeddings that are informed by visual content — not just by who saved what. This means even items with sparse interaction history get meaningful collaborative embeddings if their visual features are strong. For a deeper comparison of model policies, see the choosing a model policy guide.
Step 3: Query — The Related Pins API Call
# app.py
import requests
SHAPED_API_KEY = "your-api-key"
def get_related_pins(pin_id: str, user_id: str = None, limit: int = 20):
"""
Multimodal Related Pins: CLIP (visual) + Two-Tower (collab) + AI-enriched (semantic)
"""
response = requests.post(
"https://api.shaped.ai/v2/engines/related_pins/query",
headers={"x-api-key": SHAPED_API_KEY},
json={
"query": """
SELECT *
FROM similarity(
embedding_ref='clip_embedding',
encoder='item_attribute_pooling',
input_item_id=$item_id,
limit=500
)
WHERE item_id != $item_id
ORDER BY score(
expression='
0.5 / (1.0 + rank(embedding="clip_embedding"))
+ 0.3 / (1.0 + rank(embedding="twotower_embedding"))
+ 0.2 / (1.0 + rank(embedding="content_embedding"))
',
input_user_id=$user_id
)
REORDER BY diversity(diversity_lookback_window=50)
LIMIT $limit
""",
"parameters": {
"item_id": pin_id,
"user_id": user_id,
"limit": limit
},
"return_metadata": True
}
)
return response.json()['results']
What each stage does:
- Retrieve —
similarity(embedding_ref='clip_embedding', input_item_id=$item_id)fetches the 500 most visually similar pins to the source pin using CLIP embeddings - Filter —
WHERE item_id != $item_idexcludes the pin itself - Score — Blends three signals: 50% visual similarity (CLIP), 30% collaborative affinity (Two-Tower), 20% semantic match (AI-enriched text). The
input_user_idpersonalizes the Two-Tower score. - Reorder —
diversity(diversity_lookback_window=50)prevents the feed from showing 20 near-identical images - Return — Top 20 related pins with metadata
Example response:
{
"results": [
{
"item_id": "pin_9921",
"title": "Living Room Goals",
"image_url": "https://cdn.example.com/pin_9921.jpg",
"board_category": "Home Decor",
"score": 0.847,
"metadata": {
"clip_rank": 3,
"twotower_rank": 1,
"content_rank": 7
}
},
{
"item_id": "pin_0445",
"title": "",
"image_url": "https://cdn.example.com/pin_0445.jpg",
"board_category": "Home Decor",
"score": 0.812,
"metadata": {
"clip_rank": 1,
"twotower_rank": 12,
"content_rank": 2
}
}
]
}
Notice pin_0445 — it has no title. But it ranked #2 overall because it was the top CLIP match (#1 visual similarity) and the AI-enriched description made it #2 for content similarity. Without the AI View, this pin would have had zero text signal and ranked much lower.
The Difference in Action
This is where the three signals earn their keep. Let’s go back to the opening scenario: a user saves a pin of a mid-century modern living room — walnut furniture, mustard accents, brass lighting.
Here’s what each approach returns:
CLIP only (visual similarity):
| # | Result | Why it matched | Relevant? |
|---|---|---|---|
| 1 | Mid-century modern dining room, walnut table | Same wood tones, similar era | Yes |
| 2 | Mid-century modern living room, teak credenza | Nearly identical style | Yes |
| 3 | Minimalist bathroom, brass fixtures, warm wood | Same color palette, wrong room | No |
| 4 | Real estate listing photo, staged living room | Same composition, stock photography feel | Borderline |
| 5 | Mid-century office space, Herman Miller chairs | Same era, wrong context (commercial) | Borderline |
3 out of 5 are useful. The bathroom and office sneak in because they look similar.
Shaped multimodal (CLIP + Two-Tower + AI View):
| # | Result | Why it ranked | Relevant? |
|---|---|---|---|
| 1 | Mid-century modern living room, similar layout, different color scheme | CLIP: visual match. Two-Tower: users who saved the source pin also saved this. Content: AI View tagged both as “mid-century modern, warm tones, organic shapes” | Yes |
| 2 | Scandinavian-MCM fusion living room, low-profile sofa | CLIP: similar composition. Content: AI View found “clean lines, natural materials, 1960s influence” — shared style DNA | Yes |
| 3 | Mid-century modern bedroom, walnut headboard, mustard throw | CLIP: similar palette. Two-Tower: high co-save rate with living room pins. Content: “warm wood, mustard accent, mid-century” | Yes |
| 4 | Vintage furniture store pin showing a walnut credenza | Two-Tower: users who save MCM rooms also save furniture shopping pins. Content: “walnut, mid-century, credenza” | Yes |
| 5 | DIY guide: how to refinish mid-century furniture | Two-Tower: strong co-save signal. Content: “mid-century, restoration, walnut.” CLIP rank was #47 (visually different — it’s a tutorial, not a room photo) | Yes |
5 out of 5 are useful. The bathroom and office are gone. Instead, the system surfaced a furniture shopping pin (#4) and a DIY guide (#5) that CLIP alone would never have found — they don’t look like living rooms, but they match the user’s intent.
That’s the difference. CLIP finds images that look alike. The multimodal blend finds content that belongs together in a user’s discovery journey.
Handling Cold Start: New Pins with Zero Interactions
When a pin is uploaded with zero saves:
- CLIP embedding — computed immediately from the image
- Content embedding — AI View extracts description, text embedding generated
- Two-Tower embedding — no interaction data yet (skipped)
Shift the blend:
# app.py
def get_related_to_new_pin(pin_id: str, limit: int = 20):
"""
New pins: 60% visual + 40% semantic. No collaborative signal needed.
"""
response = requests.post(
"https://api.shaped.ai/v2/engines/related_pins/query",
headers={"x-api-key": SHAPED_API_KEY},
json={
"query": """
SELECT *
FROM similarity(
embedding_ref='clip_embedding',
encoder='item_attribute_pooling',
input_item_id=$item_id,
limit=500
)
WHERE item_id != $item_id
ORDER BY score(
expression='
0.6 / (1.0 + rank(embedding="clip_embedding"))
+ 0.4 / (1.0 + rank(embedding="content_embedding"))
'
)
REORDER BY diversity(diversity_lookback_window=50)
LIMIT $limit
""",
"parameters": {
"item_id": pin_id,
"limit": limit
}
}
)
return response.json()['results']
The pin enters the Related Pins feed immediately. No waiting for saves. No cold-start gap. As interactions accumulate, you transition to the full three-signal blend with the CASE WHEN pattern shown in the Score Ensemble Strategies section.
Building the System End-to-End
Full setup in four steps
1. Connect your data
# tables/pins.yaml
version: v2
name: pins
connector:
type: postgres
connection_string: $DATABASE_URL
table: pins
schema:
- name: item_id
type: STRING
- name: title
type: STRING
- name: description
type: STRING
- name: image_url
type: STRING
- name: board_category
type: STRING
- name: created_at
type: TIMESTAMP
# tables/pin_interactions.yaml
version: v2
name: pin_interactions
connector:
type: postgres
connection_string: $DATABASE_URL
table: pin_interactions
schema:
- name: interaction_id
type: STRING
- name: user_id
type: STRING
- name: item_id
type: STRING
- name: interaction_type
type: STRING
- name: created_at
type: TIMESTAMP
2. Create the AI View
shaped create-view --file views/pin_image_enrichment.yaml
Shaped processes every pin image through the vision-language model and materializes the descriptions. New pins are enriched automatically as they arrive in the pins table.
3. Create and train the engine
shaped create-engine --file engines/related_pins.yaml
4. Query
related = get_related_pins(pin_id="pin_8829", user_id="user_445", limit=20)
That’s it. No FAISS to manage. No ALS retraining cron job. No TF-IDF pipeline for text features. One YAML config, one AI View, one query.
Score Ensemble Strategies
ShapedQL score expressions let you adapt the blend dynamically — without retraining or redeploying. Here are four strategies for different contexts.
Strategy 1: Adaptive blending by pin age
New pins rely on visual + semantic. Established pins lean on collaborative.
-- adaptive_by_age.sql
ORDER BY score(
expression='
CASE
WHEN days_since_release < 3 THEN
0.55 / (1.0 + rank(embedding="clip_embedding"))
+ 0.45 / (1.0 + rank(embedding="content_embedding"))
WHEN days_since_release < 14 THEN
0.4 / (1.0 + rank(embedding="clip_embedding"))
+ 0.3 / (1.0 + rank(embedding="twotower_embedding"))
+ 0.3 / (1.0 + rank(embedding="content_embedding"))
ELSE
0.3 / (1.0 + rank(embedding="clip_embedding"))
+ 0.5 / (1.0 + rank(embedding="twotower_embedding"))
+ 0.2 / (1.0 + rank(embedding="content_embedding"))
END
'
)
Strategy 2: Boost trending pins
Surface pins that are gaining traction by mixing in popularity:
-- trending_visual.sql
ORDER BY score(
expression='
0.4 / (1.0 + rank(embedding="clip_embedding"))
+ 0.25 / (1.0 + rank(embedding="twotower_embedding"))
+ 0.15 / (1.0 + rank(embedding="content_embedding"))
+ 0.2 / (1.0 + item._derived_popular_rank)
'
)
Strategy 3: Category-constrained visual similarity
Keep results within the same board category to prevent cross-category bleed (no bathroom results for a kitchen pin):
-- same_category.sql
SELECT *
FROM similarity(
embedding_ref='clip_embedding',
encoder='item_attribute_pooling',
input_item_id=$item_id,
limit=500
)
WHERE board_category = $source_category
AND item_id != $item_id
ORDER BY score(
expression='
0.5 / (1.0 + rank(embedding="clip_embedding"))
+ 0.3 / (1.0 + rank(embedding="twotower_embedding"))
+ 0.2 / (1.0 + rank(embedding="content_embedding"))
'
)
REORDER BY diversity(diversity_lookback_window=50)
LIMIT 20
Strategy 4: Personalized visual home feed
The strategies above start from a single pin. But you can also retrieve pins similar to a user’s entire recent save history — turning the Related Pins engine into a personalized home feed. The key difference: encoder='interaction_round_robin' pools the user’s recent saves instead of anchoring on one pin.
-- visual_home_feed.sql
SELECT *
FROM similarity(
embedding_ref='clip_embedding',
encoder='interaction_round_robin',
input_user_id=$user_id,
limit=500
)
ORDER BY score(
expression='
0.4 / (1.0 + rank(embedding="clip_embedding"))
+ 0.4 / (1.0 + rank(embedding="twotower_embedding"))
+ 0.2 / (1.0 + rank(embedding="content_embedding"))
',
input_user_id=$user_id
)
REORDER BY exploration(diversity_lookback_window=50)
LIMIT 20
REORDER BY exploration() injects variety from outside the candidate set — breaking filter bubbles by surfacing pins the user might like but wouldn’t discover through pure similarity. For more on exploration and diversity reordering, see the ranking architectures series.
Comparison: Traditional vs. Shaped
~800 lines of infrastructure code -> ~60 lines of YAML + query. Same multimodal architecture. Three pipelines collapsed into one engine. That’s the difference.
| Component | Traditional Multimodal | Shaped Multimodal |
|---|---|---|
| Visual embeddings | Self-hosted CLIP + FAISS (manage GPU, indexing, updates) | openai/clip-vit-base-patch32 in engine config (auto-computed, auto-indexed) |
| Image understanding | None — rely on pin descriptions (often empty) | AI Views: vision-language model extracts style, colors, objects, mood from every image |
| Collaborative model | ALS (separate system, weekly retraining) | Two-Tower (built-in, auto-trained, uses item features + interactions) |
| Text features | TF-IDF on descriptions (sparse, manual) | Sentence transformer embeddings on AI-enriched descriptions (dense, semantic) |
| Blending | Hardcoded in app code | ShapedQL score expressions (query-time, adaptive, no redeploy) |
| Cold start | Separate code path; new pins only get CLIP results | Adjust blend in query with CASE WHEN days_since_release < 3 |
| Index updates | Manual FAISS rebuild (often nightly) | Automatic — new pins indexed in real-time |
| Infrastructure | 3 models + 3 indexes + 3 pipelines + blending code | 1 engine, 1 YAML config |
| Lines of code | ~800 (CLIP pipeline + ALS + TF-IDF + FAISS + blending) | ~60 (YAML config + query) |
| Bottom line | Three teams maintaining three systems with brittle glue code | One config file, one query, ship in a day |
If you’re coming from a traditional recommendation stack and want to understand how Shaped’s query layer maps to the four-stage retrieval architecture, the Anatomy of Modern Ranking Architectures series covers this in depth.
FAQ
Q: Why use CLIP instead of training a custom vision model?
A: CLIP was trained on 400 million image-text pairs — it already understands visual concepts at a level that would take months to replicate from scratch. Shaped supports any model from HuggingFace’s CLIP model library in the index.embeddings config. Start with openai/clip-vit-base-patch32 for speed, or swap in openai/clip-vit-large-patch14 for higher quality. For domain-specific tuning, layer in a Two-Tower or BeeFormer model that incorporates image features alongside interaction data.
Q: What if my images don’t have public URLs?
A: The image_url field needs publicly accessible URLs for both CLIP embedding computation and AI View enrichment. If your images are in private storage (S3, GCS), use an SQL View to transform references into pre-signed URLs before they reach the engine.
Q: How does the AI View handle millions of images?
A: AI Views are materialized — the vision-language model processes each image once and stores the result. New images are enriched incrementally as they arrive. You never reprocess the catalog. See the multi-modal enrichment docs for configuration details.
Q: Why Two-Tower instead of ELSA for the collaborative signal?
A: ELSA and ALS are pure collaborative filtering models — they learn only from interaction co-occurrence. Two-Tower incorporates item features (including image type, board category, and text) into the item embedding, which means even pins with sparse interaction history get useful collaborative embeddings. If your platform has minimal item metadata and dense interactions, ELSA is simpler and faster. If you have rich item features (images, categories, descriptions), Two-Tower is the better choice. We compare these tradeoffs in our Discover Weekly article.
Q: Can I use this for e-commerce product images, not just Pinterest?
A: Yes — the architecture works for any image-heavy catalog: fashion (outfit similarity), real estate (property photos), interior design (room matching), food delivery (dish similarity), and marketplace listings. Replace pins with products, adjust the AI View prompt (e.g., “describe the garment’s cut, fabric, pattern, and occasion”), and the engine config stays the same.
Q: How do I tune the blend weights?
A: Start with 50/30/20 (visual/collaborative/content). Measure save rate, click-through rate, and session depth. Change weights in ShapedQL — no retraining, no redeployment. You can run multiple saved queries with different blends simultaneously to A/B test.
Q: Can users search “boho bedroom” and get visually coherent results?
A: Yes. Add lexical_search to the engine config for keyword matching, and use text_search(mode='vector', text_embedding_ref='content_embedding') for semantic search on AI-enriched descriptions. You can combine text retrieval with CLIP scoring in the same query — useful for search results that look visually coherent, not just keyword-matched. See the search guide for patterns.
Q: Can I blend more than 3 signals?
A: ShapedQL supports arbitrary scoring expressions. Add popularity, recency, user affinity, or any custom feature:
ORDER BY score(
expression='
0.35 / (1.0 + rank(embedding="clip_embedding"))
+ 0.25 / (1.0 + rank(embedding="twotower_embedding"))
+ 0.15 / (1.0 + rank(embedding="content_embedding"))
+ 0.1 / (1.0 + item._derived_popular_rank)
+ 0.1 / (1.0 + days_since_release)
+ 0.05 * user_category_affinity
'
)
Q: What about Pinterest’s actual implementation?
A: Pinterest uses PinSage — a graph neural network trained on the pin-board bipartite graph — alongside visual embeddings, text signals, and real-time engagement features. More recently, they’ve moved toward generative retrieval models like PinRec (we wrote a full teardown of the architecture). Their system reflects a decade of infrastructure investment. The principles are identical: multimodal signals, blended scoring, adaptive cold-start handling. Shaped lets you build a production-grade version of this architecture without the multi-year infrastructure build.
Conclusion
Remember the mid-century modern living room from the top of this article? A user saves it. Thirty seconds later, they’re scrolling through related pins — walnut furniture, Scandinavian-MCM fusion spaces, a vintage furniture store, a DIY refinishing guide. Some of those pins were uploaded this morning with zero saves. One had no title at all.
That’s not magic. It’s three signals doing what none of them can do alone.
CLIP saw that the images share composition, color palette, and visual style — and gave every new pin a visual fingerprint the moment it was uploaded. The AI View looked at an untitled image and wrote “mid-century modern living room, walnut credenza, mustard yellow accent chair, brass pendant lighting, warm, curated, 1960s influence” — turning empty metadata into searchable features. Two-Tower learned that users who save MCM living rooms also save furniture shopping pins and restoration guides — intent signals that no visual model could capture.
ShapedQL blended them: 50% visual for the core aesthetic match, 30% collaborative to surface what users like this one actually want, 20% semantic to catch the style connections that bridge living rooms and bedrooms and furniture stores. For the pin uploaded this morning, the blend shifted automatically: 60% visual, 40% semantic, no collaborative signal needed.
The traditional approach to building this requires a self-hosted CLIP inference pipeline, a FAISS index you rebuild nightly, an ALS model you retrain weekly, a TF-IDF index that’s useless when descriptions are empty, and a blending function hardcoded in application code. Three teams maintaining three systems with brittle glue code.
Shaped collapses all of it: one engine, one YAML config, one query. CLIP embeddings managed for you. AI Views that turn every image into searchable features automatically. Two-Tower trained on your interaction data. ShapedQL to blend everything at query time, no redeployment needed.
Ready to build your own Related Pins? Sign up for Shaped and get $100 in free credits. Visit console.shaped.ai/register to get started.
Want us to walk you through it?
Book a 30-min session with an engineer who can apply this to your specific stack.