How to Build Pinterest's "Related Pins": The Multimodal Discovery Playbook

Learn how to build image-based recommendation systems like Pinterest's Related Pins using CLIP embeddings for visual similarity, AI Views for automatic image understanding, and ShapedQL score ensembles to blend visual, collaborative, and semantic signals — with full code examples.

Mar 12, 2026

min read

Nic Scheltema

Quick Answer: Visual Embeddings + Collaborative Signals + AI-Extracted Semantics

A user saves a pin of a mid-century modern living room. Thirty seconds later, they’re scrolling through a feed of related pins: similar furniture styles, matching color palettes, rooms with the same cozy-but-minimal aesthetic. Some of these pins were uploaded this morning — zero saves, zero clicks, no engagement history at all.

How?

Pinterest’s Related Pins blends three types of intelligence:

Visual similarity (CLIP embeddings): “This image looks like that image” — color, composition, style, objects
Collaborative filtering (Two-Tower): “Users who saved this also saved…”
Semantic understanding (AI-extracted features): Style labels, dominant colors, mood, room type — pulled directly from the image by a vision-language model

Neither works alone:

Visual similarity alone surfaces anything that looks similar — a red shoe and a red car share color features, but nobody wants a car when shopping for shoes
Collaborative filtering alone can’t recommend new pins (zero interactions = zero signal)
Text metadata alone is often empty on Pinterest — most pins have no description, just an image

Multimodal discovery fuses all three signals. The result: new content enters the discovery loop from the moment it’s uploaded, and established content benefits from behavioral intelligence that captures taste patterns no visual model can learn.

Key Takeaways:

CLIP embeddings give every image a visual fingerprint instantly — no interaction history needed
AI Views fill the metadata gap — vision-language models extract style, objects, mood, and color from images that have no title or description
Two-Tower models learn what CLIP can’t — personalized taste patterns from save/click behavior
Score ensembles in ShapedQL blend all three signals adaptively — more visual weight for new pins, more collaborative weight for established ones
One engine, one config — no separate FAISS index, no ALS retraining pipeline, no TF-IDF

Time to read: 22 minutes | Includes: 9 code examples, 2 architecture diagrams, 1 comparison table

This is Part 2 of the “How to Build” series. Part 1 covers Spotify’s Discover Weekly with hybrid filtering. This article focuses on image-first platforms where visual similarity is the primary signal.

Why Visual Discovery Is a Different Problem
Why Pure Visual Similarity Fails
Why Pure Collaborative Filtering Fails
Part 1: The Traditional Approach (and Why It Hurts)
Part 2: The Shaped Way — CLIP + AI Views + Two-Tower
Building the System End-to-End
Score Ensemble Strategies
Comparison: Traditional vs. Shaped
FAQ

Why Visual Discovery Is a Different Problem

Most recommendation systems start with structured data: product catalogs with brand, category, price, and size. You can filter, facet, and match on those attributes.

Pinterest doesn’t have that luxury. The core unit of content is an image. And images don’t come with structured metadata. A pin of a beautiful kitchen doesn’t have a style: scandinavian field. It has pixels.

The Pinterest Discovery Loop

A user sees a pin. They save it. Pinterest must now answer the hardest question in visual recommendation:

What makes this pin interesting to this user?

Is it the furniture style? The color palette? The room layout? The specific lamp in the corner? The vibe?

The answer varies by user. Two people save the same photo of a rustic farmhouse kitchen:

User A (renovating their house) wants to see similar kitchen layouts and cabinet styles
User B (a food blogger) wants to see recipe pins shot in similar-looking kitchens

This is why text-based recommendations fail on visual platforms. The interesting features are latent — they live inside the pixels, and their relevance depends on who’s looking.

When you tap a pin on Pinterest, the Related Pins feed blends:

Visual features — CLIP embeddings capture composition, color, objects, style
Collaborative signals — Save/click co-occurrence patterns learned from user behavior
Semantic signals — Pin descriptions, board names, and extracted text
Engagement signals — Save rate, click-through rate, close-up rate

When a new pin is uploaded with zero saves, visual features and semantic signals carry the load. As the pin accumulates interactions, collaborative signals take over. The blend shifts automatically.

For a deep dive into how Pinterest’s production retrieval model actually works under the hood, see our PinRec teardown.

Why Pure Visual Similarity Fails

CLIP (Contrastive Language-Image Pre-training) encodes images and text into the same embedding space. Two images with similar CLIP embeddings look visually similar. This is powerful — but it has three failure modes that prevent it from working alone.

Failure 1: Visual similarity does not equal intent similarity

A user saves a pin of a white minimalist kitchen. Pure CLIP similarity returns:

Rank	Result	Relevant?
1	White minimalist kitchen with marble island	Yes
2	White minimalist kitchen with pendant lights	Yes
3	White minimalist bathroom (similar palette)	No — wrong room
4	White product photography studio (similar lighting)	No — wrong category
5	White minimalist kitchen from a real estate listing	Borderline

CLIP captures what the image looks like, not what the user wants from it. Visually similar does not equal useful.

Failure 2: No personalization, no intent

CLIP embeddings describe the image, not the user. Everyone who taps the same pin gets identical results — there’s no “users like you tend to prefer…” signal.

Worse, CLIP can’t distinguish why someone saved a pin. A blue velvet sofa in a styled living room gets saved for the fabric, the layout, and the brand — by three different users with three different needs. CLIP returns the same visually similar images for all three.

Limitation	Impact
Visual does not equal intent	Irrelevant results across categories (kitchens to bathrooms)
No personalization	Same results for every user, every time
Ambiguous intent	Can’t tell why a user saved a pin — fabric? layout? brand?

Why Pure Collaborative Filtering Fails

Collaborative filtering learns from behavior: “Users who saved Pin A also saved Pin B.” It captures taste patterns that no visual model can learn — but it has a fatal weakness on visual platforms.

New pin uploaded: stunning sunset over Santorini, uploaded 5 minutes ago. Collaborative filtering has zero interaction data. The pin never enters Related Pins. By the time enough users save it, it’s no longer new.

This cold-start problem is worse on Pinterest than on music or e-commerce platforms. Those have finite catalogs. Pinterest has an effectively infinite stream of user-generated images — and collaborative filtering is blind to all of them until humans interact.

The other limitations compound: popularity bias means the top 1% of pins dominate recommendations, sparse interactions (most pins have fewer than 5 saves) produce noisy signals, and collaborative filtering has no visual understanding — two co-saved pins might look completely different.

If you’ve read our Discover Weekly hybrid filtering playbook, you know the cold-start story. The difference on visual platforms is that the solution requires a visual signal, not just content enrichment on text metadata.

Part 1: The Traditional Approach (and Why It Hurts)

The traditional approach runs three separate systems and merges their outputs in application code.

Architecture:

Visual stream

Pin image

↓

Self-hosted CLIP inference

↓

FAISS index

↓

Top 100 (visual score)

Collaborative stream

Save / click history

↓

ALS model (weekly retrain)

↓

FAISS index

↓

Top 100 (collab score)

Text stream

Pin descriptions

↓

TF-IDF vectorizer

↓

Inverted index

↓

Top 100 (text score)

↓

Application code blend

final = 0.5 × visual + 0.3 × collab + 0.2 × text

→ Rank by final score → Top 20 related pins

Here’s what that implementation looks like in practice — and why every step is a maintenance burden:

# related_pins_traditional.py — The pain in 40 lines

def get_related_pins(pin_id, n=20):
    # 1. Visual: query CLIP FAISS index (you manage GPU inference + index updates)
    clip_embedding = clip_embeddings[pin_id]
    visual_distances, visual_ids = faiss_index.search(clip_embedding.reshape(1, -1), 100)

    # 2. Collaborative: compute ALS dot products (you retrain weekly)
    collab_scores = {}
    pin_factor = als_model.item_factors[pin_id]
    for cid in visual_ids[0]:
        collab_scores[cid] = float(np.dot(pin_factor, als_model.item_factors[cid]))

    # 3. Text: TF-IDF cosine similarity (pin descriptions are often empty)
    text_scores = {}
    for cid in visual_ids[0]:
        text_scores[cid] = cosine_similarity(text_vectors[pin_id], text_vectors[cid])[0][0]

    # 4. Blend in application code (hardcoded weights, no adaptivity)
    blended = {}
    for cid in visual_ids[0]:
        blended[cid] = (
            0.5 * normalize(visual_distances[0][list(visual_ids[0]).index(cid)])
            + 0.3 * normalize(collab_scores.get(cid, 0))
            + 0.2 * normalize(text_scores.get(cid, 0))
        )

    return sorted(blended.items(), key=lambda x: x[1], reverse=True)[:n]

This is ~40 lines. But behind those 40 lines, you’re managing:

What You Maintain	What It Costs
CLIP inference pipeline	GPU provisioning, model serving, batch processing for new pins
FAISS index	Incremental updates are hard; most teams rebuild nightly
ALS model	Weekly retraining, sparse matrix construction, model serialization
TF-IDF index	Useless when pin descriptions are empty (most of the time on Pinterest)
Blending logic	Hardcoded in application code; changing weights requires a deploy
Three separate data pipelines	ETL, monitoring, alerting — times three

The real cost isn’t the code. It’s the infrastructure. Three models, three indexes, three update pipelines, one fragile blending function, and no adaptivity.

Part 2: The Shaped Way — CLIP + AI Views + Two-Tower

Shaped replaces all three systems with a single engine that computes CLIP embeddings, extracts visual features via AI Views, trains collaborative models, and blends everything in ShapedQL.

Architecture:

Visual stream

Pin image

↓

CLIP embeddings
(openai/clip-vit-base-patch32)

↓

Auto-indexed, auto-updated

Semantic stream

Pin image (no description needed)

↓

AI Enrichment View
(VLM extracts style, colors, objects, mood)

↓

Text embeddings
(sentence-transformers/modernbert)

Collaborative stream

Save / click interactions

↓

Two-Tower model
(auto-trained, uses item features)

↓

Collaborative embeddings

↓

ShapedQL Query — blended at query time, no redeploy needed

Retrieve via CLIP similarity

Score: 0.5 × CLIP + 0.3 × Two-Tower + 0.2 × content

Reorder by diversity

Return top 20

↓

Related Pins feed — works for new pins and established content alike

Three key differences from the traditional approach:

AI Views fill the metadata gap. Most pins have no description. Shaped’s AI View runs a vision-language model over every pin image and extracts structured attributes — style, dominant colors, objects, room type, mood — automatically. This turns “untitled image with zero metadata” into a richly described item that text embeddings can reason about.
CLIP embeddings are managed infrastructure. No FAISS. No GPU provisioning. Declare openai/clip-vit-base-patch32 in your engine config and Shaped handles inference, indexing, and incremental updates.
Score blending is a query, not a deploy. Changing from 50/30/20 to 60/20/20 is a one-line edit in ShapedQL. No model retraining, no redeployment.

Step 1: AI Views — Turning Images Into Searchable Features

This is the Shaped differentiator for visual platforms. Pinterest’s core problem is that images lack metadata. AI Views solve this by analyzing the actual pixels.

# views/pin_image_enrichment.yaml
version: v2
name: pin_image_enrichment
view_type: AI_ENRICHMENT
source_table: pins
source_columns:
  - item_id
  - title
  - image_url
source_columns_in_output:
  - item_id
  - title
enriched_output_columns:
  - visual_description
prompt: |
  Analyze the pin image and extract visual attributes as a concise description.
  Include: primary design style (e.g., minimalist, bohemian, industrial, mid-century modern),
  dominant colors and color palette, key objects and furniture visible,
  room type or scene category, mood or aesthetic (e.g., cozy, dramatic, airy),
  and notable patterns, textures, or materials.
  Focus on attributes that help match this pin with visually similar content.
  Keep the description factual — no marketing language.

Shaped runs a vision-language model over every pin image and materializes the result. Here’s what the output looks like for real pins:

Pin A: No title, no description — just an image of a living room.

Field	Value
`item_id`	pin_8829
`title`	(empty)
`visual_description`	Mid-century modern living room. Walnut credenza, mustard yellow accent chair, brass pendant lighting. White walls, warm wood floors. Geometric rug in cream and rust. Clean lines, organic shapes. Warm, curated, 1960s influence.

Pin B: Title is just a plant emoji — not useful for matching.

Field	Value
`item_id`	pin_3341
`title`	(plant emoji)
`visual_description`	Indoor plant arrangement on wooden floating shelf. Monstera deliciosa, trailing pothos, snake plant in terracotta pots and woven baskets. White wall backdrop, natural light from left. Bohemian, earthy, organic aesthetic.

Pin C: Title says “inspo” — not useful for matching.

Field	Value
`item_id`	pin_7102
`title`	inspo
`visual_description`	Scandinavian kitchen. White oak cabinets, terrazzo countertops, matte black faucet. Open shelving with ceramic dishes. Linen pendant lights. Sage green subway tile backsplash. Minimal, airy, natural materials.

Why this matters: Without the AI View, pins A, B, and C have no usable text features. TF-IDF on an empty title returns nothing. But with AI-enriched descriptions, text embeddings can now compute meaningful semantic similarity: Pin A and Pin C both describe “minimal” aesthetics with “natural materials” — they’ll appear in each other’s related feeds even though one is a living room and the other is a kitchen, because the style matches.

For more on configuring AI Views — including multi-column enrichment and prompt best practices — see the AI enrichment documentation.

Step 2: Configure the Engine — CLIP + Text + Two-Tower

# engines/related_pins.yaml
version: v2
name: related_pins

data:
  item_table:
    name: pins
    type: table
  user_table:
    name: users
    type: table
  interaction_table:
    name: pin_interactions
    type: table
  schema_override:
    item:
      id: item_id
      features:
        - name: image_url
          type: Image
        - name: title
          type: Text
        - name: board_category
          type: TextCategory
      created_at: created_at
    interaction:
      id: interaction_id
      item_id: item_id
      user_id: user_id
      label: interaction_type
      created_at: created_at

index:
  embeddings:
    # 1. Visual: CLIP embeddings on pin images
    - name: clip_embedding
      encoder:
        type: hugging_face
        model_name: openai/clip-vit-base-patch32
        batch_size: 32
        item_fields:
          - image_url

    # 2. Semantic: Text embeddings on AI-enriched descriptions
    - name: content_embedding
      encoder:
        type: hugging_face
        model_name: sentence-transformers/modernbert
        batch_size: 256
        item_fields:
          - pin_image_enrichment.visual_description

    # 3. Collaborative: Trained Two-Tower embeddings
    - name: twotower_embedding
      encoder:
        type: trained_model
        model_ref: twotower_collab

training:
  models:
    - name: twotower_collab
      policy_type: two_tower
      strategy: early_stopping

When you run shaped create-engine --file engines/related_pins.yaml, Shaped:

Computes CLIP embeddings for every pin image (batch_size: 32 — smaller batches for image processing)
Generates sentence transformer embeddings on AI-enriched visual descriptions (batch_size: 256 for text)
Trains a Two-Tower model on pin_interactions — learning user-item affinities that blend collaborative and content signals
Builds ANN indexes for all three embedding spaces
Auto-updates when new pins arrive — no manual re-indexing

Why Two-Tower instead of ELSA or ALS? Pure collaborative models like ELSA and ALS learn item relationships solely from co-occurrence patterns — they ignore item features entirely. Two-Tower is different: it separates user and item computation into two neural networks, and the item tower can incorporate rich features like image embeddings, board categories, and text descriptions alongside interaction data. On a visual platform where item features (the image itself) carry enormous signal, Two-Tower produces collaborative embeddings that are informed by visual content — not just by who saved what. This means even items with sparse interaction history get meaningful collaborative embeddings if their visual features are strong. For a deeper comparison of model policies, see the choosing a model policy guide.

# app.py
import requests

SHAPED_API_KEY = "your-api-key"

def get_related_pins(pin_id: str, user_id: str = None, limit: int = 20):
    """
    Multimodal Related Pins: CLIP (visual) + Two-Tower (collab) + AI-enriched (semantic)
    """
    response = requests.post(
        "https://api.shaped.ai/v2/engines/related_pins/query",
        headers={"x-api-key": SHAPED_API_KEY},
        json={
            "query": """
                SELECT *
                FROM similarity(
                    embedding_ref='clip_embedding',
                    encoder='item_attribute_pooling',
                    input_item_id=$item_id,
                    limit=500
                )
                WHERE item_id != $item_id
                ORDER BY score(
                    expression='
                        0.5 / (1.0 + rank(embedding="clip_embedding"))
                        + 0.3 / (1.0 + rank(embedding="twotower_embedding"))
                        + 0.2 / (1.0 + rank(embedding="content_embedding"))
                    ',
                    input_user_id=$user_id
                )
                REORDER BY diversity(diversity_lookback_window=50)
                LIMIT $limit
            """,
            "parameters": {
                "item_id": pin_id,
                "user_id": user_id,
                "limit": limit
            },
            "return_metadata": True
        }
    )

    return response.json()['results']

What each stage does:

Retrieve — similarity(embedding_ref='clip_embedding', input_item_id=$item_id) fetches the 500 most visually similar pins to the source pin using CLIP embeddings
Filter — WHERE item_id != $item_id excludes the pin itself
Score — Blends three signals: 50% visual similarity (CLIP), 30% collaborative affinity (Two-Tower), 20% semantic match (AI-enriched text). The input_user_id personalizes the Two-Tower score.
Reorder — diversity(diversity_lookback_window=50) prevents the feed from showing 20 near-identical images
Return — Top 20 related pins with metadata

Example response:

{
  "results": [
    {
      "item_id": "pin_9921",
      "title": "Living Room Goals",
      "image_url": "https://cdn.example.com/pin_9921.jpg",
      "board_category": "Home Decor",
      "score": 0.847,
      "metadata": {
        "clip_rank": 3,
        "twotower_rank": 1,
        "content_rank": 7
      }
    },
    {
      "item_id": "pin_0445",
      "title": "",
      "image_url": "https://cdn.example.com/pin_0445.jpg",
      "board_category": "Home Decor",
      "score": 0.812,
      "metadata": {
        "clip_rank": 1,
        "twotower_rank": 12,
        "content_rank": 2
      }
    }
  ]
}

Notice pin_0445 — it has no title. But it ranked #2 overall because it was the top CLIP match (#1 visual similarity) and the AI-enriched description made it #2 for content similarity. Without the AI View, this pin would have had zero text signal and ranked much lower.

The Difference in Action

This is where the three signals earn their keep. Let’s go back to the opening scenario: a user saves a pin of a mid-century modern living room — walnut furniture, mustard accents, brass lighting.

Here’s what each approach returns:

CLIP only (visual similarity):

#	Result	Why it matched	Relevant?
1	Mid-century modern dining room, walnut table	Same wood tones, similar era	Yes
2	Mid-century modern living room, teak credenza	Nearly identical style	Yes
3	Minimalist bathroom, brass fixtures, warm wood	Same color palette, wrong room	No
4	Real estate listing photo, staged living room	Same composition, stock photography feel	Borderline
5	Mid-century office space, Herman Miller chairs	Same era, wrong context (commercial)	Borderline

3 out of 5 are useful. The bathroom and office sneak in because they look similar.

Shaped multimodal (CLIP + Two-Tower + AI View):

#	Result	Why it ranked	Relevant?
1	Mid-century modern living room, similar layout, different color scheme	CLIP: visual match. Two-Tower: users who saved the source pin also saved this. Content: AI View tagged both as “mid-century modern, warm tones, organic shapes”	Yes
2	Scandinavian-MCM fusion living room, low-profile sofa	CLIP: similar composition. Content: AI View found “clean lines, natural materials, 1960s influence” — shared style DNA	Yes
3	Mid-century modern bedroom, walnut headboard, mustard throw	CLIP: similar palette. Two-Tower: high co-save rate with living room pins. Content: “warm wood, mustard accent, mid-century”	Yes
4	Vintage furniture store pin showing a walnut credenza	Two-Tower: users who save MCM rooms also save furniture shopping pins. Content: “walnut, mid-century, credenza”	Yes
5	DIY guide: how to refinish mid-century furniture	Two-Tower: strong co-save signal. Content: “mid-century, restoration, walnut.” CLIP rank was #47 (visually different — it’s a tutorial, not a room photo)	Yes

5 out of 5 are useful. The bathroom and office are gone. Instead, the system surfaced a furniture shopping pin (#4) and a DIY guide (#5) that CLIP alone would never have found — they don’t look like living rooms, but they match the user’s intent.

That’s the difference. CLIP finds images that look alike. The multimodal blend finds content that belongs together in a user’s discovery journey.

Handling Cold Start: New Pins with Zero Interactions

When a pin is uploaded with zero saves:

CLIP embedding — computed immediately from the image
Content embedding — AI View extracts description, text embedding generated
Two-Tower embedding — no interaction data yet (skipped)

Shift the blend:

# app.py
def get_related_to_new_pin(pin_id: str, limit: int = 20):
    """
    New pins: 60% visual + 40% semantic. No collaborative signal needed.
    """
    response = requests.post(
        "https://api.shaped.ai/v2/engines/related_pins/query",
        headers={"x-api-key": SHAPED_API_KEY},
        json={
            "query": """
                SELECT *
                FROM similarity(
                    embedding_ref='clip_embedding',
                    encoder='item_attribute_pooling',
                    input_item_id=$item_id,
                    limit=500
                )
                WHERE item_id != $item_id
                ORDER BY score(
                    expression='
                        0.6 / (1.0 + rank(embedding="clip_embedding"))
                        + 0.4 / (1.0 + rank(embedding="content_embedding"))
                    '
                )
                REORDER BY diversity(diversity_lookback_window=50)
                LIMIT $limit
            """,
            "parameters": {
                "item_id": pin_id,
                "limit": limit
            }
        }
    )

    return response.json()['results']

The pin enters the Related Pins feed immediately. No waiting for saves. No cold-start gap. As interactions accumulate, you transition to the full three-signal blend with the CASE WHEN pattern shown in the Score Ensemble Strategies section.

Building the System End-to-End

Full setup in four steps

1. Connect your data

# tables/pins.yaml
version: v2
name: pins
connector:
  type: postgres
  connection_string: $DATABASE_URL
  table: pins
schema:
  - name: item_id
    type: STRING
  - name: title
    type: STRING
  - name: description
    type: STRING
  - name: image_url
    type: STRING
  - name: board_category
    type: STRING
  - name: created_at
    type: TIMESTAMP

# tables/pin_interactions.yaml
version: v2
name: pin_interactions
connector:
  type: postgres
  connection_string: $DATABASE_URL
  table: pin_interactions
schema:
  - name: interaction_id
    type: STRING
  - name: user_id
    type: STRING
  - name: item_id
    type: STRING
  - name: interaction_type
    type: STRING
  - name: created_at
    type: TIMESTAMP

2. Create the AI View

shaped create-view --file views/pin_image_enrichment.yaml

Shaped processes every pin image through the vision-language model and materializes the descriptions. New pins are enriched automatically as they arrive in the pins table.

3. Create and train the engine

shaped create-engine --file engines/related_pins.yaml

4. Query

related = get_related_pins(pin_id="pin_8829", user_id="user_445", limit=20)

That’s it. No FAISS to manage. No ALS retraining cron job. No TF-IDF pipeline for text features. One YAML config, one AI View, one query.

Score Ensemble Strategies

ShapedQL score expressions let you adapt the blend dynamically — without retraining or redeploying. Here are four strategies for different contexts.

Strategy 1: Adaptive blending by pin age

New pins rely on visual + semantic. Established pins lean on collaborative.

-- adaptive_by_age.sql
ORDER BY score(
    expression='
        CASE
            WHEN days_since_release < 3 THEN
                0.55 / (1.0 + rank(embedding="clip_embedding"))
                + 0.45 / (1.0 + rank(embedding="content_embedding"))
            WHEN days_since_release < 14 THEN
                0.4 / (1.0 + rank(embedding="clip_embedding"))
                + 0.3 / (1.0 + rank(embedding="twotower_embedding"))
                + 0.3 / (1.0 + rank(embedding="content_embedding"))
            ELSE
                0.3 / (1.0 + rank(embedding="clip_embedding"))
                + 0.5 / (1.0 + rank(embedding="twotower_embedding"))
                + 0.2 / (1.0 + rank(embedding="content_embedding"))
        END
    '
)

Surface pins that are gaining traction by mixing in popularity:

-- trending_visual.sql
ORDER BY score(
    expression='
        0.4 / (1.0 + rank(embedding="clip_embedding"))
        + 0.25 / (1.0 + rank(embedding="twotower_embedding"))
        + 0.15 / (1.0 + rank(embedding="content_embedding"))
        + 0.2 / (1.0 + item._derived_popular_rank)
    '
)

Strategy 3: Category-constrained visual similarity

Keep results within the same board category to prevent cross-category bleed (no bathroom results for a kitchen pin):

-- same_category.sql
SELECT *
FROM similarity(
    embedding_ref='clip_embedding',
    encoder='item_attribute_pooling',
    input_item_id=$item_id,
    limit=500
)
WHERE board_category = $source_category
  AND item_id != $item_id
ORDER BY score(
    expression='
        0.5 / (1.0 + rank(embedding="clip_embedding"))
        + 0.3 / (1.0 + rank(embedding="twotower_embedding"))
        + 0.2 / (1.0 + rank(embedding="content_embedding"))
    '
)
REORDER BY diversity(diversity_lookback_window=50)
LIMIT 20

Strategy 4: Personalized visual home feed

The strategies above start from a single pin. But you can also retrieve pins similar to a user’s entire recent save history — turning the Related Pins engine into a personalized home feed. The key difference: encoder='interaction_round_robin' pools the user’s recent saves instead of anchoring on one pin.

-- visual_home_feed.sql
SELECT *
FROM similarity(
    embedding_ref='clip_embedding',
    encoder='interaction_round_robin',
    input_user_id=$user_id,
    limit=500
)
ORDER BY score(
    expression='
        0.4 / (1.0 + rank(embedding="clip_embedding"))
        + 0.4 / (1.0 + rank(embedding="twotower_embedding"))
        + 0.2 / (1.0 + rank(embedding="content_embedding"))
    ',
    input_user_id=$user_id
)
REORDER BY exploration(diversity_lookback_window=50)
LIMIT 20

REORDER BY exploration() injects variety from outside the candidate set — breaking filter bubbles by surfacing pins the user might like but wouldn’t discover through pure similarity. For more on exploration and diversity reordering, see the ranking architectures series.

Comparison: Traditional vs. Shaped

~800 lines of infrastructure code -> ~60 lines of YAML + query. Same multimodal architecture. Three pipelines collapsed into one engine. That’s the difference.

Component	Traditional Multimodal	Shaped Multimodal
Visual embeddings	Self-hosted CLIP + FAISS (manage GPU, indexing, updates)	openai/clip-vit-base-patch32 in engine config (auto-computed, auto-indexed)
Image understanding	None — rely on pin descriptions (often empty)	AI Views: vision-language model extracts style, colors, objects, mood from every image
Collaborative model	ALS (separate system, weekly retraining)	Two-Tower (built-in, auto-trained, uses item features + interactions)
Text features	TF-IDF on descriptions (sparse, manual)	Sentence transformer embeddings on AI-enriched descriptions (dense, semantic)
Blending	Hardcoded in app code	ShapedQL score expressions (query-time, adaptive, no redeploy)
Cold start	Separate code path; new pins only get CLIP results	Adjust blend in query with `CASE WHEN days_since_release < 3`
Index updates	Manual FAISS rebuild (often nightly)	Automatic — new pins indexed in real-time
Infrastructure	3 models + 3 indexes + 3 pipelines + blending code	1 engine, 1 YAML config
Lines of code	~800 (CLIP pipeline + ALS + TF-IDF + FAISS + blending)	~60 (YAML config + query)
Bottom line	Three teams maintaining three systems with brittle glue code	One config file, one query, ship in a day

If you’re coming from a traditional recommendation stack and want to understand how Shaped’s query layer maps to the four-stage retrieval architecture, the Anatomy of Modern Ranking Architectures series covers this in depth.

FAQ

Q: Why use CLIP instead of training a custom vision model?

A: CLIP was trained on 400 million image-text pairs — it already understands visual concepts at a level that would take months to replicate from scratch. Shaped supports any model from HuggingFace’s CLIP model library in the index.embeddings config. Start with openai/clip-vit-base-patch32 for speed, or swap in openai/clip-vit-large-patch14 for higher quality. For domain-specific tuning, layer in a Two-Tower or BeeFormer model that incorporates image features alongside interaction data.

Q: What if my images don’t have public URLs?

A: The image_url field needs publicly accessible URLs for both CLIP embedding computation and AI View enrichment. If your images are in private storage (S3, GCS), use an SQL View to transform references into pre-signed URLs before they reach the engine.

Q: How does the AI View handle millions of images?

A: AI Views are materialized — the vision-language model processes each image once and stores the result. New images are enriched incrementally as they arrive. You never reprocess the catalog. See the multi-modal enrichment docs for configuration details.

Q: Why Two-Tower instead of ELSA for the collaborative signal?

A: ELSA and ALS are pure collaborative filtering models — they learn only from interaction co-occurrence. Two-Tower incorporates item features (including image type, board category, and text) into the item embedding, which means even pins with sparse interaction history get useful collaborative embeddings. If your platform has minimal item metadata and dense interactions, ELSA is simpler and faster. If you have rich item features (images, categories, descriptions), Two-Tower is the better choice. We compare these tradeoffs in our Discover Weekly article.

Q: Can I use this for e-commerce product images, not just Pinterest?

A: Yes — the architecture works for any image-heavy catalog: fashion (outfit similarity), real estate (property photos), interior design (room matching), food delivery (dish similarity), and marketplace listings. Replace pins with products, adjust the AI View prompt (e.g., “describe the garment’s cut, fabric, pattern, and occasion”), and the engine config stays the same.

Q: How do I tune the blend weights?

A: Start with 50/30/20 (visual/collaborative/content). Measure save rate, click-through rate, and session depth. Change weights in ShapedQL — no retraining, no redeployment. You can run multiple saved queries with different blends simultaneously to A/B test.

Q: Can users search “boho bedroom” and get visually coherent results?

A: Yes. Add lexical_search to the engine config for keyword matching, and use text_search(mode='vector', text_embedding_ref='content_embedding') for semantic search on AI-enriched descriptions. You can combine text retrieval with CLIP scoring in the same query — useful for search results that look visually coherent, not just keyword-matched. See the search guide for patterns.

Q: Can I blend more than 3 signals?

A: ShapedQL supports arbitrary scoring expressions. Add popularity, recency, user affinity, or any custom feature:

ORDER BY score(
    expression='
        0.35 / (1.0 + rank(embedding="clip_embedding"))
        + 0.25 / (1.0 + rank(embedding="twotower_embedding"))
        + 0.15 / (1.0 + rank(embedding="content_embedding"))
        + 0.1 / (1.0 + item._derived_popular_rank)
        + 0.1 / (1.0 + days_since_release)
        + 0.05 * user_category_affinity
    '
)

Q: What about Pinterest’s actual implementation?

A: Pinterest uses PinSage — a graph neural network trained on the pin-board bipartite graph — alongside visual embeddings, text signals, and real-time engagement features. More recently, they’ve moved toward generative retrieval models like PinRec (we wrote a full teardown of the architecture). Their system reflects a decade of infrastructure investment. The principles are identical: multimodal signals, blended scoring, adaptive cold-start handling. Shaped lets you build a production-grade version of this architecture without the multi-year infrastructure build.

Conclusion

Remember the mid-century modern living room from the top of this article? A user saves it. Thirty seconds later, they’re scrolling through related pins — walnut furniture, Scandinavian-MCM fusion spaces, a vintage furniture store, a DIY refinishing guide. Some of those pins were uploaded this morning with zero saves. One had no title at all.

That’s not magic. It’s three signals doing what none of them can do alone.

CLIP saw that the images share composition, color palette, and visual style — and gave every new pin a visual fingerprint the moment it was uploaded. The AI View looked at an untitled image and wrote “mid-century modern living room, walnut credenza, mustard yellow accent chair, brass pendant lighting, warm, curated, 1960s influence” — turning empty metadata into searchable features. Two-Tower learned that users who save MCM living rooms also save furniture shopping pins and restoration guides — intent signals that no visual model could capture.

ShapedQL blended them: 50% visual for the core aesthetic match, 30% collaborative to surface what users like this one actually want, 20% semantic to catch the style connections that bridge living rooms and bedrooms and furniture stores. For the pin uploaded this morning, the blend shifted automatically: 60% visual, 40% semantic, no collaborative signal needed.

The traditional approach to building this requires a self-hosted CLIP inference pipeline, a FAISS index you rebuild nightly, an ALS model you retrain weekly, a TF-IDF index that’s useless when descriptions are empty, and a blending function hardcoded in application code. Three teams maintaining three systems with brittle glue code.

Shaped collapses all of it: one engine, one YAML config, one query. CLIP embeddings managed for you. AI Views that turn every image into searchable features automatically. Two-Tower trained on your interaction data. ShapedQL to blend everything at query time, no redeployment needed.

Ready to build your own Related Pins? Sign up for Shaped and get $100 in free credits. Visit console.shaped.ai/register to get started.

Want us to walk you through it?

Book a 30-min session with an engineer who can apply this to your specific stack.

Book a demo →

How to Build Pinterest's "Related Pins": The Multimodal Discovery Playbook

Quick Answer: Visual Embeddings + Collaborative Signals + AI-Extracted Semantics

Table of Contents

Why Visual Discovery Is a Different Problem

The Pinterest Discovery Loop

Why Pure Visual Similarity Fails

Failure 1: Visual similarity does not equal intent similarity

Failure 2: No personalization, no intent

Why Pure Collaborative Filtering Fails

Part 1: The Traditional Approach (and Why It Hurts)

Part 2: The Shaped Way — CLIP + AI Views + Two-Tower

Step 1: AI Views — Turning Images Into Searchable Features

Step 2: Configure the Engine — CLIP + Text + Two-Tower

The Difference in Action

Handling Cold Start: New Pins with Zero Interactions

Building the System End-to-End

Full setup in four steps

Score Ensemble Strategies

Strategy 1: Adaptive blending by pin age

Strategy 3: Category-constrained visual similarity

Strategy 4: Personalized visual home feed

Comparison: Traditional vs. Shaped

FAQ

Conclusion

Want us to walk you through it?

Related Articles

Further Reading

How to Build Pinterest's "Related Pins": The Multimodal Discovery Playbook

Quick Answer: Visual Embeddings + Collaborative Signals + AI-Extracted Semantics

Table of Contents

Why Visual Discovery Is a Different Problem

The Pinterest Discovery Loop

What Related Pins Actually Does

Why Pure Visual Similarity Fails

Failure 1: Visual similarity does not equal intent similarity

Failure 2: No personalization, no intent

Why Pure Collaborative Filtering Fails

Part 1: The Traditional Approach (and Why It Hurts)

Part 2: The Shaped Way — CLIP + AI Views + Two-Tower

Step 1: AI Views — Turning Images Into Searchable Features

Step 2: Configure the Engine — CLIP + Text + Two-Tower

Step 3: Query — The Related Pins API Call

The Difference in Action

Handling Cold Start: New Pins with Zero Interactions

Building the System End-to-End

Full setup in four steps

Score Ensemble Strategies

Strategy 1: Adaptive blending by pin age

Strategy 2: Boost trending pins

Strategy 3: Category-constrained visual similarity

Strategy 4: Personalized visual home feed

Comparison: Traditional vs. Shaped

FAQ

Conclusion

Want us to walk you through it?

Related Articles

Further Reading