Reading progress

How to Clean Catalogs for Search and Recsys (Without an ETL Team)

Walkthrough: Postgres → ClickHouse SQL view → AI Enrichment view → Shaped engine config → ShapedQL. Written for someone who will actually wire it up.

How to Clean Catalogs for Search and Recsys (Without an ETL Team)

Someone searches minimalist white cotton tee, office-ready and the engine returns rows that shouldn’t be there: out of stock, style still empty, a tee miscategorized so filters miss it, the same item twice under different supplier SKUs.

That’s a catalog problem, not an embedding problem.

Cleaning before ingest usually loses to the roadmap—“we’ll finish it in dbt” keeps slipping while search is already live. Here’s a concrete path in Shaped: SQL views on ClickHouse for dedupe, joins, casts, and normalization, then AI Enrichment views where you add columns in YAML and spell out what you want in the prompt. Both stay in sync as upstream rows change.

A Shaped engine reads the enriched table like anything else—lexical and vector retrieval, filters on those columns, and ORDER BY score(...).

Quick answer: two view types

  1. SQL view — ClickHouse against synced tables: dedupe, casts, joins, normalization.
  2. AI Enrichment view — One LLM call per source row (batched/cached by the platform). You list enriched_output_columns and describe each field in prompt; the model fills those fields from source_columns (e.g. name, description, category). There is no hidden ontology—you specify the schema and allowed values, same as you would in any structured-extraction task.

Both refresh when upstream rows change. The engine’s config points at the view name; it does not care whether a column came from SQL or an LLM.

Why catalog shape matters for retrieval

Retrieval is mostly “score these rows under these filters.” Duplicates waste top slots. Bad categories break facet filters. Strings in numeric columns break sort. Attributes buried only in prose never become WHERE clauses or typed features.

Knock-on effect (one sentence): if filters are noisy, your logged training data for reranking is noisy too.

Cleaning belongs before the engine ingests the item table. The only design choice is where you run that transform.

How to build this yourself

Below is the same stack most teams end up with if they don’t buy a packaged transform layer—just laid out so you can estimate the work.

Path A — SQL transforms (e.g. dbt)

Build: Models that normalize keys, cast types, dedupe, join reference data.

Gets expensive when: rules explode (supplier-specific quirks), you need something that is not in structured columns (extract from free text), and tests pass but production still drifts when vendors change formats.

Path B — Nightly LLM worker

Build: Job that calls an API per row, writes enriched columns back to your warehouse, handles backlog and retries.

Gets expensive when: You are operating backoff, dead letters, cache invalidation on text changes, and cost/latency monitors—i.e. a small service around “what should have been a table column.”

Build: Separate materializations per stage, each with its own job schedule.

Gets expensive when: Relevance regresses and you’re diffing four tables to see which layer stale’d.

None of these are wrong—they’re what teams do when the item table has to be good enough for production retrieval.

How Shaped fits

Shaped runs the same two primitives against data you connect (Postgres, BigQuery, etc.): SQL views on ClickHouse, and AI Enrichment views with managed batching/retries. Schema changes trigger re-materialization; streaming connectors are on the order of tens of seconds for fresh rows.

What you still own: the SQL and the prompt. What you do not have to host: a separate enrichment worker and a second persistence layer unless you want one.

Diagram: Shaped data layer — connectors, tables, SQL and AI enrichment views, and engine

SQL Views: Joins, types, rules

SQL views run ClickHouse SQL against your tables. They support joins, aggregations, casts, regex, and the full ClickHouse function set. You get two flavors:

  1. MATERIALIZED_VIEW (default). Query runs and the result is stored as a physical table. Updates automatically when source tables change. Use this for anything an engine reads frequently.
  2. VIEW (on-demand). Query runs every time the view is read. Always fresh, no storage. Use for low-traffic or freshness-critical reads.

Set the flavor with sql_view_type on the API payload.

AI Enrichment Views: The LLM Half

AI Enrichment views run an LLM over each row of a source table and write the output to new columns. How does the model know style vs material? It does not infer your taxonomy by magic—you list names in enriched_output_columns and define meaning, allowed values, and fallbacks in prompt. The runtime parses the response into those columns (with retries/validation on Shaped’s side). If you want a cross-cutting facet like “ergonomic,” use a string column such as tags and ask for comma-separated labels instead of a one-off boolean that only makes sense for one vertical.

The contract:

# views/<name>.yaml — AI enrichment view shape (CLI / console upload)
name: <view_name>
transform_type: AI_ENRICHMENT
source_dataset: <table_or_view_name>
source_columns:
  - <columns_to_pass_to_the_llm>
source_columns_in_output:
  - <columns_to_copy_through_unchanged>
enriched_output_columns:
  - <new_columns_the_llm_will_fill>
prompt: |
  <instructions for the model>

Each row in the source dataset becomes one LLM call. The response is parsed into enriched_output_columns and written to a new table named after the view. Source rows that change get re-enriched automatically.

Cost is per row at ingest, not per retrieval call. Rough order of magnitude for a small chat model: single-digit dollars per million rows for a one-time backfill, then tiny incremental cost as new rows land—versus paying on every user query if the LLM sits in the request path.

When to Use SQL vs AI Enrichment

When in doubt, prefer SQL whenever the transformation is rule-based—joins, typing, lookups, branching on known values. Reach for AI Enrichment when the signal lives in unstructured text (or images) and belongs in typed columns at ingest instead of at query time.

ProblemUse SQL viewUse AI Enrichment view
Normalize a price column (strings to floats, currency conversion)CAST, multiIf, conversions between units/currencies
Join inventory and reviews into a single item rowLEFT JOIN across tables
Dedup by canonical key (e.g. SKU prefix)GROUP BY + argMax(updated_at)
Translate shipping_tier: 4 into a customer-facing summaryIf the mapping is fixed and small: multiIfIf the mapping is large or you want it indexed as text
Extract style, material, fit from a free-text descriptionPrompt the LLM with the description; emit as new columns
Cross-cutting facets (ergonomic, organic, …) that don’t deserve a dedicated boolean per verticalEmit a tags string (or normalize further in a downstream SQL view)
Generate semantic tags (mood, theme, vibe) for a movie or playlistPrompt with title + genres; emit a tag list
Caption a product imageMultimodal prompt against an image_url column
Filter to “active” SKUs onlyWHERE inventory_count > 0

Building a Cleaned Catalog: 5 Steps

The scenario: a fashion marketplace where the catalog comes from suppliers. Inconsistent categories. Duplicate SKUs. Prices in mixed currencies. Free-text descriptions. The style field exists on maybe 40% of rows.

We’ll build, in order:

  1. The raw tables.
  2. A SQL view that normalizes and dedupes.
  3. An AI Enrichment view that fills in missing structured attributes.
  4. An engine that reads the enriched view.
  5. A ShapedQL query that exercises the cleaned data.
Diagram: raw tables → SQL view → AI enrichment view → Shaped engine

Step 1: Connect Raw Tables

Provision tables with POST /v2/tables (same shape as shaped create-table --file). See Data unification for how connectors fit into the pipeline. For Postgres the connector discovers schema from your database—optional columns narrows what gets synced (API reference).

# tables/products_raw.yaml
name: products_raw
schema_type: POSTGRES
host: your-db-host.example.com
port: 5432
user: your_user
password: your_password
database: your_database
table: products
replication_key: updated_at
columns:
  - product_id
  - sku_prefix
  - name
  - description
  - category
  - price # stored as text like "$49.99" upstream — normalize in the SQL view
  - currency
  - image_url
  - updated_at
shaped create-table --file tables/products_raw.yaml

Same idea for inventory (product_id, inventory_count, updated_at) and currency_rates (currency, usd_rate, updated_at): separate table configs with schema_type: POSTGRES (or another batch connector). Once connected, rows land within ~30 seconds for streaming connectors or ~15 minutes for batch ones.

Step 2: SQL View — Normalize, Dedup, Convert Currency

This is the SQL normalization pass: dedupe by SKU prefix, coerce category strings to snake_case, convert prices to USD, drop rows missing essential fields.

# views/products_clean.yaml
name: products_clean
view_type: SQL
sql_view_type: MATERIALIZED_VIEW
sql_query: |
  WITH latest AS (
    SELECT
      argMax(product_id,  updated_at) AS item_id,
      sku_prefix,
      argMax(name,        updated_at) AS name,
      argMax(description, updated_at) AS description,
      argMax(category,    updated_at) AS category_raw,
      argMax(price,       updated_at) AS price_raw,
      argMax(currency,    updated_at) AS currency,
      argMax(image_url,   updated_at) AS image_url,
      max(updated_at)                 AS updated_at
    FROM products_raw
    GROUP BY sku_prefix
  )
  SELECT
    l.item_id,
    l.sku_prefix,
    l.name,
    l.description,
    lowerUTF8(replaceRegexpAll(l.category_raw, '[^a-zA-Z0-9]+', '_')) AS category,
    toFloat64OrZero(replaceAll(l.price_raw, '$', ''))
      * coalesce(c.usd_rate, 1.0)                                     AS price_usd,
    l.image_url,
    coalesce(i.inventory_count, 0) AS inventory_count,
    l.updated_at
  FROM latest l
  LEFT JOIN inventory       i ON l.item_id = i.product_id
  LEFT JOIN currency_rates  c ON l.currency = c.currency
  WHERE l.name != '' AND l.description != ''
shaped create-view --file views/products_clean.yaml

What’s happening:

  • argMax(<col>, updated_at) per sku_prefix keeps the most recently updated value of each column—cheap dedup without hand-written merge rules per field.
  • replaceRegexpAll + lowerUTF8 collapse supplier category strings into one token bucket per leaf, e.g. "MEN'S TEES", "mens tops"mens_tops (same idea as shoe categories collapsing to mens_shoes).
  • toFloat64OrZero coerces "$49.99" to a float, returning 0 on bad rows instead of failing the whole materialization.
  • The LEFT JOINs pull inventory and current FX rates without writing to upstream.

Sanity-check the view before training an engine—run dataset SQL against data.<table> with POST /v2/query (same SQL goes in shaped query / the console when supported). Step 5 uses ShapedQL (retrieve, text_search, …) against an engine; here we’re only inspecting the materialized view:

SELECT
  category,
  count() AS n
FROM data.products_clean
GROUP BY category
ORDER BY n DESC
LIMIT 20

If the categories aren’t collapsing the way you expect, catch it here—before those values flow into filters and training logs downstream.

Step 3: AI Enrichment View — Fill In the Structured Attributes

The cleaned table still has gaps. style, material, fit, use_case, season are inconsistent or missing. We’ll add a tags string for cross-cutting facets (e.g. ergonomic) so we are not hard-coding a boolean that only fits one product type. We also emit semantic_summary for embeddings.

Each name in enriched_output_columns must appear in the prompt with rules; the ingest path parses the model output into those columns—there is no side channel.

# views/products_enriched.yaml
name: products_enriched
transform_type: AI_ENRICHMENT
source_dataset: products_clean
source_columns:
  - name
  - description
  - category
source_columns_in_output:
  - item_id
  - sku_prefix
  - name
  - category
  - price_usd
  - image_url
  - inventory_count
enriched_output_columns:
  - style
  - material
  - fit
  - use_case
  - season
  - tags
  - semantic_summary
prompt: |
  You are a fashion catalog analyst. Using the product name, description, and normalized category, produce values for:

  - style: one of streetwear, classic, athleisure, formal, workwear, vintage, minimalist, y2k, boho. Use unknown if you can't tell.
  - material: primary material lowercase (cotton, denim, leather, polyester, wool). unknown if not mentioned.
  - fit: one of slim, regular, relaxed, oversized, tailored, unknown.
  - use_case: one of casual, office, formal, athletic, outdoor, loungewear.
  - season: one of spring, summer, fall, winter, all_season.
  - tags: comma-separated lowercase tokens for cross-cutting attributes that are explicitly supported by the text—e.g. ergonomic, organic, wrinkle_free. Use an empty string if none apply. Do not invent tags.
  - semantic_summary: 1-2 factual sentences describing what the product is and what makes it distinctive. No marketing language. No exclamation points.

  If a field cannot be inferred from the inputs, use unknown (or all_season for season) rather than guessing.
shaped create-view --file views/products_enriched.yaml

What we are doing with the prompt:

  • Fixed vocabularies for facet columns (style, use_case, …).
  • Explicit “don’t guess” fallbacks.
  • tags for awkward booleans—works the same for desk chairs (ergonomic) and apparel if the copy supports it.

After backfill, products_enriched has the pass-through columns plus seven LLM columns, all queryable on data.products_enriched like the SQL fields.

Step 4: Engine Config That Reads the Enriched View

EngineConfigV2 wires data (which tables/views the engine ingests) and index (lexical + vector encoders). Embeddings use hugging_face with model_name and item_fields. schema_override sets column types (Text, Numerical, TextCategory, …).

# engines/fashion_search.yaml
version: v2
name: fashion_search
description: Semantic search over enriched catalog rows

data:
  item_table:
    type: table
    name: products_enriched
  schedule: "@hourly"
  schema_override:
    item:
      id: item_id
      features:
        - name: sku_prefix
          type: Text
        - name: name
          type: Text
        - name: description
          type: Text
        - name: category
          type: TextCategory
        - name: price_usd
          type: Numerical
        - name: image_url
          type: Url
        - name: inventory_count
          type: Numerical
        - name: style
          type: TextCategory
        - name: material
          type: TextCategory
        - name: fit
          type: TextCategory
        - name: use_case
          type: TextCategory
        - name: season
          type: TextCategory
        - name: tags
          type: Text
        - name: semantic_summary
          type: Text

index:
  embeddings:
    - name: catalog_text_embedding
      encoder:
        type: hugging_face
        model_name: sentence-transformers/all-MiniLM-L6-v2
        item_fields:
          - name
          - semantic_summary
          - style
          - material
          - tags
  lexical_search:
    tokenizer:
      type: stemmer
    item_fields:
      - name
      - category
      - semantic_summary
      - tags
shaped create-engine --file engines/fashion_search.yaml

Shaped backfills the embedding index against the enriched view. When products_raw later changes (a supplier pushes new SKUs), the dedup re-runs in products_clean, the LLM enrichment re-runs in products_enriched, and the embedding index updates. You don’t touch the engine YAML for every upstream tweak.

Step 5: Query — Semantics From the Search Box, Rules From the App

text_search binds the query string to the embedding you configured (catalog_text_embedding here), i.e. it drives retrieval over name, semantic_summary, style, material, tags without parsing the query into facets at request time.

WHERE is for constraints your application already knows—inventory policy, a price cap from a slider, SKU allowlists, locale—not for guessing structured facets from free text (that would be a separate NLU layer outside Shaped, which we’re not covering here).

-- Natural language drives retrieval; max_price + in-stock come from UI / API params.
SELECT * FROM retrieve(
  text_search(
    text_embedding_ref = 'catalog_text_embedding',
    input_text_query   = $query_text,
    limit              = 200
  )
)
WHERE inventory_count > 0
  AND price_usd <= $max_price
ORDER BY score(expression = 'similarity_score') DESC
LIMIT 20

Why enrichment still matters for this query:

  1. semantic_summary (and related columns in the embedding) give the vector index something stable to latch onto—otherwise “office-ready tee” might weak-match messy supplier copy alone.
  2. inventory_count and price_usd are real columns from your SQL view—hard filters evaluate as plain predicates with no LLM in the query path.

If you do filter on style, use_case, or substring-match tags (e.g. ergonomic), treat those as explicit inputs from the UI—bind parameters; do not try to infer them from $query_text inside this query unless you own a separate NLU step.

From the application:

# rank.py
import os
import httpx

API_KEY = os.environ["SHAPED_API_KEY"]

def search(query_text: str, max_price: float = 80.0):
    response = httpx.post(
        "https://api.shaped.ai/v2/engines/fashion_search/query",
        headers={"x-api-key": API_KEY},
        json={
            "query": (
                "SELECT * FROM retrieve(text_search("
                "text_embedding_ref='catalog_text_embedding', "
                "input_text_query=$query_text, limit=200)) "
                "WHERE inventory_count > 0 "
                "AND price_usd <= $max_price "
                "ORDER BY score(expression='similarity_score') DESC "
                "LIMIT 20"
            ),
            "parameters": {"query_text": query_text, "max_price": max_price},
        },
        timeout=2.0,
    )
    response.raise_for_status()
    return response.json()["results"]

That’s the full pipeline. YAML for tables, views, and engine; one ShapedQL query; Python shows how to call the query API.

Beyond Fashion: Three More Use Cases

Same pattern, different verticals:

Use Case 1: Translating Status Codes for Agent Retrieval

A customer support agent reads from your orders table. The table stores shipping_tier: 4, order_status: 7, priority_level: 2. The agent retrieves the row and answers, “Your order has shipping tier 4 and status 7.” Useless.

The fix is an AI Enrichment view that materializes a customer-facing summary alongside the raw codes. The agent retrieves the summary; the codes stay queryable for joins. Full walkthrough is in The Status Code Problem, but the shape:

# views/orders_with_summary.yaml
name: orders_with_summary
transform_type: AI_ENRICHMENT
source_dataset: orders
source_columns:
  - shipping_tier
  - order_status
  - priority_level
  - created_at
source_columns_in_output:
  - order_id
  - user_id
  - shipping_tier
  - order_status
enriched_output_columns:
  - ai_order_summary
prompt: >
  Translate the shipping tier, order status, and priority level into a 2-3 sentence customer-facing summary.
  Shipping tiers: 1=Economy Ground, 2=Standard, 3=Expedited, 4=Express (2 days, FedEx), 5=Overnight, 6=Same-Day.
  Order statuses: 1=processing, 2=payment confirmed, 3=picking, 4=packing, 5=ready to ship, 6=label created, 7=in transit, 8=out for delivery, 9=delivered, 10=exception.
  Priority: 0=standard, 1=low, 2=medium, 3=high, 4=urgent, 5=critical.
  Write plain conversational prose suitable for a customer—not a bullet list or machine-readable wrapper.

The agent indexes ai_order_summary and answers in sentences instead of codes.

Use Case 2: Tag-shaped facets instead of one-off booleans

Someone filters for “ergonomic” across office chairs, standing desks, and a few apparel SKUs that call it out in copy. A dedicated is_ergonomic column does not generalize; a tags string (or JSON array if you prefer) does.

Materialize it once:

# views/products_attributes.yaml
name: products_attributes
transform_type: AI_ENRICHMENT
source_dataset: products_clean
source_columns:
  - name
  - description
  - category
source_columns_in_output:
  - item_id
  - name
  - category
  - price
enriched_output_columns:
  - material
  - primary_color
  - tags
  - use_case
  - style_tag
prompt: >
  From the product name, description, and category, infer: material (lowercase primary material, or unknown),
  primary_color (dominant color lowercase, or unknown), tags (comma-separated lowercase tokens explicitly supported
  by the text—e.g. ergonomic, adjustable, compact; empty string if none), use_case (one of office, dining, living_room,
  outdoor, bedroom, commercial), and style_tag (one of modern, traditional, industrial, mid_century, rustic, minimalist).

Then retrieval can use WHERE tags ILIKE '%ergonomic%' (or a token index if you normalize further in SQL) as a hard filter before an agent sees the shortlist. Our post on building reliable agents expands on pushing structure upstream instead of patching it solely in prompts.

Use Case 3: Captioning Product Images

If your catalog has rich images and thin text, multimodal enrichment fills in the gap. Pass an image_url column in source_columns and the view routes through a multimodal model:

# views/products_image_captions.yaml
name: products_image_captions
transform_type: AI_ENRICHMENT
source_dataset: products_clean
source_columns:
  - name
  - image_url
source_columns_in_output:
  - item_id
  - name
enriched_output_columns:
  - image_caption
  - visual_style
prompt: >
  Analyze the product image together with the product name. Produce image_caption: 1-2 factual sentences covering color,
  materials, and key visual features a shopper would notice. Produce visual_style as one of modern, traditional, industrial,
  mid_century, rustic, minimalist. Emit only those two fields.

If your image column is named something other than image_url, drop a five-line SQL view in front of it that renames the column. That’s the multimodal contract.

Comparison: Traditional ETL vs Shaped Views

ComponentTraditional stackShaped views
SQL transformsdbt project + AirflowSQL view (single YAML)
LLM enrichmentCustom Python service + Redis cacheAI Enrichment view (single YAML)
Source-to-search lagHours (nightly job)Seconds (streaming)
MaterializationManual writes back to source DBAutomatic, materialized in Shaped
Schema changesUpdate dbt + redeploy serviceUpdate view definition; re-backfill automatically
Failure modesDrift between layers, silent stalenessSingle layer; lineage from source → views → engine
Code to maintain~1,000 lines (dbt + Python)~50 lines (YAML)
Operational costdbt Cloud + worker + cache (varies)Bundled with the Shaped plan you run
Time to first working pathOften multi-sprintDepends on connectors; often days vs. months

Closing thoughts

If the item table is wrong, every retrieval stack you put on top will look worse than it should. SQL views and AI Enrichment views are one way to keep transforms next to the data the engine ingests—SQL-first for joins and typing, LLM enrichment for what doesn’t compress into rules—without standing up a separate enrichment service.

Try it: console.shaped.ai/register — the walkthrough above is sized to run on sandbox credit.

See our Docs

Want us to walk you through it?

Book a 30-minute slot with an engineer if you want a second pair of eyes on your schema.

Book a demo →

See Shaped in action

Talk to an engineer about your specific use case — search, recommendations, or feed ranking.

Book a demo →

Related Articles