Unlock Text Data: NLP Feature Engineering for Search & Recs

Keyword matching and interaction history aren’t enough for modern relevance. Language data, like product descriptions, search queries, and user reviews, holds rich signals that drive deeper personalization. But turning text into model-ready features requires complex NLP pipelines, model selection, infrastructure, and ongoing maintenance. Shaped automates all of this. With built-in language understanding and Hugging Face model integration, teams can tap into the full power of semantic signals, without building or managing an NLP stack.

June 11, 2025

min read

Tullie Murrell

Beyond Keywords: The Power of Understanding Language for Relevance

In modern search and recommendation systems, simply matching keywords or relying on interaction history isn't enough. The rich, unstructured language embedded in your platform – product titles, detailed descriptions, article content, user reviews, search queries – holds the key to deeper relevance. Understanding this language allows systems to grasp:

True Content Meaning: What is this product really about, beyond its category tags?
Semantic Similarity: Are these two items conceptually related, even if described differently or lacking shared interaction history?
User Intent: What does a user actually mean when they type a complex search query?
Latent Preferences: Can we infer user interests from the language they use or consume?

Transforming this raw text into meaningful signals, or features, that machine learning models can utilize is a critical, yet challenging, aspect of feature engineering. Get it right, and relevance skyrockets. Neglect it, and you miss crucial context. The standard path to engineering language features involves diving deep into the complex and resource-intensive world of Natural Language Processing (NLP).

The Standard Approach: Building Your Own Language Understanding Pipeline

Leveraging language requires turning unstructured text into structured numerical representations (embeddings) that capture semantic meaning. Doing this yourself typically involves a multi-stage, expert-driven process:

Step 1: Gathering and Preprocessing Text Data

Collection: Aggregate text from diverse sources – product catalogs, content management systems, user-generated content databases, search logs.
Cleaning: This is often 80% of the work. Handle messy HTML, remove special characters, standardize encoding, potentially translate different languages, deal with inconsistent formatting across sources (short titles vs. long articles vs. JSON blobs).
Normalization: Tokenize text (breaking into words/sub-words), handle casing, potentially apply stemming or lemmatization (though less critical for modern transformer models).
Pipelines: Build and maintain robust data pipelines to automate this ingestion and cleaning process reliably.

The Challenge: Text data is inherently noisy and varied. Building robust cleaning and preprocessing pipelines requires significant data engineering effort and domain knowledge.

Step 2: Choosing the Right Language Model Architecture

Selecting the appropriate NLP model to generate embeddings is crucial and requires navigating a vast, fast-moving landscape.

The Ecosystem (Hugging Face Hub): Hugging Face offers thousands of pre-trained models, serving as a common starting point. The choice depends heavily on the specific task and data.
Sentence Transformers (e.g., SBERT): Optimized for generating sentence/paragraph embeddings where semantic similarity (measured by cosine distance) is key. Great for finding similar descriptions or documents. Examples: all-MiniLM-L6-v2, distiluse-base-multilingual-cased-v2 (for multilingual needs).
Full Transformer Models (BERT Variants): Deeper contextual understanding (e.g., RoBERTa, DeBERTa). Often require more compute but offer high performance, especially after fine-tuning.
Search-Specific Models (Asymmetric): Models like DPR or ColBERT are designed for search where short queries need to match long documents, often outperforming standard symmetric embedding models.
Multimodal Models (e.g., CLIP): Models like openai/clip-vit-base-patch32 or Jina AI variants can embed both text and images into a shared space, enabling cross-modal search (text-to-image, image-to-text).
Large Language Models (LLMs): While incredibly powerful, using massive LLMs for generating embeddings for every item in real-time relevance systems can be computationally prohibitive. Their role is often more focused on complex query understanding, data generation, or zero-shot tasks currently.

The Challenge: Requires deep NLP expertise to select the appropriate architecture and pre-trained checkpoint based on data modality (text, image, both), language, task (similarity vs. search), and computational budget.

Step 3: Fine-tuning Models for Your Task and Data

Pre-trained models rarely achieve peak performance out-of-the-box. Fine-tuning adapts them to your specific data and business objectives.

Domain Adaptation: Further pre-train a model on your own large text corpus (e.g., all product descriptions) to help it learn your specific vocabulary and style.
Ranking Fine-tuning (Search/Rec): Train the model using labeled data (e.g., query-document pairs with relevance scores) to directly optimize ranking metrics like NDCG. This is complex, requiring specialized loss functions and training setups.
Personalization Fine-tuning: Train models (e.g., Two-Tower architectures) where one tower processes user features/history and the other processes item text features, optimizing the embeddings such that their similarity predicts user engagement (clicks, purchases). Requires pairing interaction data with text data during training.

The Challenge: Fine-tuning is resource-intensive (multi-GPU setups often needed), requires significant ML expertise, access to labeled data, and rigorous experimentation.

Step 4: Generating and Storing Embeddings

Once a model is ready, run inference on your text data to get the embedding vectors.

Inference at Scale: Set up batch pipelines (often GPU-accelerated) to generate embeddings for potentially millions of items.
Vector Storage: Store these high-dimensional vectors. Traditional databases struggle. Vector Databases (Pinecone, Weaviate, Milvus, etc.) are essential for efficient storage and, critically, for fast Approximate Nearest Neighbor (ANN) search required for similarity lookups.

The Challenge: Large-scale inference is computationally expensive. Deploying, managing, scaling, and securing a Vector Database adds significant operational complexity and cost.

Step 5: Integrating Embeddings into Applications

Use the generated embeddings in your live system.

Similarity Search: Build services that query the Vector Database in real-time to find similar items or users.
Feature Input: Fetch embeddings (from the Vector DB or a feature store) in real-time to feed as input features into a final ranking model (e.g., an LTR model).

The Challenge: Requires building low-latency microservices for querying/fetching embeddings. Ensuring data consistency and low latency across multiple systems (application DB, Vector DB, ranker) is hard.

Step 6: Handling Maintenance and Edge Cases

Nulls/Missing Text: Define strategies for items lacking text (e.g., zero vectors, default embeddings).
Model Retraining & Updates: Periodically retrain models, regenerate all embeddings, and update the Vector DB, ideally without downtime.
Cost Management: GPUs and specialized databases contribute significantly to infrastructure costs.

The Shaped Approach: Automated & Flexible Language Feature Engineering

The DIY path for language features is a major engineering undertaking. Shaped integrates state-of-the-art language understanding directly into its platform, offering both automated simplicity and expert-level flexibility.

How Shaped Streamlines Language Feature Engineering:

Automated Processing (Default): Simply include raw text columns (title, description, etc.) in your fetch.items query. Shaped automatically preprocesses this text and uses its built-in advanced language models (akin to Transformers) to generate internal representations (embeddings).
Native Integration: These language-derived features are natively combined with collaborative signals (user interactions) and other metadata within Shaped's unified ranking models. For standard ranking and relevance tasks, you typically don't need to manage embedding generation, Vector Databases, or feature joining manually.
Implicit Fine-tuning: Shaped's training process automatically optimizes the use of language features alongside behavioral signals to improve relevance for your specific objectives (clicks, conversions, etc.).
Flexibility via Hugging Face Integration: For users needing specific capabilities or more control, Shaped allows you to override the default language model. By setting the language_model_name parameter in your model YAML, you can specify any compatible model URI from Hugging Face (or supported custom providers like Jina AI, Nomic AI).
- Use Cases: Select specific Sentence Transformers for similarity tasks (sentence-transformers/all-MiniLM-L6-v2), choose multilingual models (sentence-transformers/distiluse-base-multilingual-cased-v2), or leverage multimodal CLIP models (openai/clip-vit-base-patch32) to embed both text and images for cross-modal search.
- How it Works: Shaped downloads the specified model and uses it to generate the internal embeddings for text (and optionally image) fields you provide. These embeddings are then seamlessly used by downstream ranking policies within Shaped.
Managed Infrastructure & Scale: Shaped transparently handles the underlying compute (including GPUs needed for transformer models), storage, and serving infrastructure for both the default and user-specified Hugging Face models.
Graceful Handling of Missing Data: Designed to handle missing text fields without requiring manual imputation.

Leveraging Language Features with Shaped

Let's see how easy it is to incorporate language features, both automatically and with specific model selection.

Goal 1: Automatically use product descriptions to improve recommendations. Goal 2: Explicitly use a specific multilingual Sentence Transformer model.

1. Ensure Data is Connected: Assume item_metadata (with description_en, description_fr) and user_interactions are connected.

2. Define Shaped Models (YAML):

Example 1: Automatic Language Handling

    automatic_language_model.yaml
    
  

    
model:
  name: auto_language_recs
  connectors:
    # ... connectors ...
  fetch:
    items: |
      SELECT
        item_id, title,
        description_en,  # <-- Just include the text field
        category, brand
      FROM items
    events: |
      # ... events query ...
# --- No language_model_name specified: Shaped uses its default ---

  

Example 2: Specifying a Hugging Face Model (Multilingual)

    multilingual_hf_model.yaml
    
  

    
model:
  name: multilingual_recs_hf
  # --- Specify the desired Hugging Face model ---
  language_model_name: sentence-transformers/distiluse-base-multilingual-cased-v2
  connectors:
    # ... connectors ...
  fetch:
    items: |
      SELECT
        item_id, title,
        description_en,  # Shaped will encode this using the specified model
        description_fr,  # Shaped will also encode this using the same model
        category, brand
      FROM items
    events: |
      # ... events query ...

  

3. Create the Models & Monitor Training:

    create_models.sh
    
shaped create-model --file automatic_language_model.yaml
shaped create-model --file multilingual_hf_model.yaml
# ... monitor both models until ACTIVE ...

4. Use Standard Shaped APIs: Call rank, similar_items, etc., using the appropriate model name. The API call remains simple, but the underlying model's relevance calculations are now powered by sophisticated language understanding (either Shaped's default or your specified Hugging Face model).

    language_ranking.py
    
  

    
from shaped import Shaped
shaped_client = Shaped()

# Get recommendations using the default language model
response_auto = shaped_client.rank(model_name='auto_language_recs', user_id='USER_1', limit=10)

# Get recommendations using the specified multilingual HF model
response_hf = shaped_client.rank(model_name='multilingual_recs_hf', user_id='USER_2', limit=10)

# The ranking benefits from language, API call is standard.

  

Conclusion: Harness Language Power, Minimize NLP Pain

Language data is a treasure trove for relevance, but extracting its value traditionally requires deep NLP expertise, complex pipelines, costly infrastructure (GPUs, Vector DBs), and constant maintenance.

Shaped revolutionizes language feature engineering. Its automated approach allows you to benefit from advanced language understanding simply by including text fields in your data. For those needing more control, the seamless Hugging Face integration provides access to a vast library of state-of-the-art models with minimal configuration. In both scenarios, Shaped manages the underlying complexity, allowing you to focus on your data and business logic, not on building and maintaining intricate NLP pipelines.

Ready to unlock the power of your text data for superior search and recommendations?

‍Request a demo of Shaped today to see how easily you can leverage language features. Or, start exploring immediately with our free trial sandbox.