Semantic Tokenization for Generative Retrieval: Introducing GenRet

Generative retrieval is emerging as a transformative approach to document retrieval, leveraging generative language models (LMs) to directly produce ranked lists of document identifiers (docids) for user queries. The paper "Learning to Tokenize for Generative Retrieval" introduces GenRet, a novel framework that addresses critical limitations in existing generative retrieval methods. GenRet employs a discrete auto-encoding approach to learn semantic document identifiers (docids), enabling more effective end-to-end modeling of document retrieval tasks compared to conventional index-retrieve paradigms.

May 12, 2025

min read

Amarpreet Kaur

Challenges in Generative Retrieval

Generative retrieval relies heavily on the concept of document tokenization, where each document is assigned a unique identifier (docid). Existing methods often use rule-based tokenization strategies, such as titles, URLs, or clustering results from embeddings. However, these approaches suffer from:

Ad-hoc nature: Rule-based methods fail to generalize well across unseen documents.
Semantic inadequacy: Tokenization often does not capture the full semantic richness of documents.
Poor generalization: Models trained on labeled datasets struggle with retrieval tasks involving unlabeled or unseen documents.

Discrete Auto-Encoding: The Technical Heart of GenRet

GenRet proposes a learning-based document tokenization method that encodes complete document semantics into discrete docids using a discrete auto-encoding approach. At each timestep t, the tokenization model encodes the document and any previously generated tokens to produce a latent representation:

dt = Decoder(Encoder(d), z<t) ∈ RD

The model then uses a codebook Et ∈ RK×D, which contains K embedding vectors, to map this representation to a discrete token. This process effectively compresses document information into a sequence of discrete tokens that serve as the document identifier. The framework consists of three key components:

1.Document Tokenization Model: Converts documents into semantic docids., which are then used by the reconstruction model to rebuild the original document.

2. Generative Retrieval Model: Generates relevant docids for a query.

3. Reconstruction Model: Reconstructs original documents from docids to ensure semantic fidelity.

Key Innovations

*Progressive Training Scheme | Image Source:* *"Learning to Tokenize for Generative Retrieval"*

Progressive Training Scheme: Stabilizes training by incrementally optimizing docid prefixes while keeping earlier prefixes fixed.

Embedding visualization : t-SNE visualization of the codebook embedding and document embedding on the NQ320K dataset. The codebook embedding is uniformly scattered in the document representation space. | Image Source: *"Learning to Tokenize for Generative Retrieval"*

Diverse Clustering Techniques:

Codebook Initialization: Ensures balanced segmentation of the semantic space using constrained clustering algorithms.
Docid Reassignment: Prevents duplicate assignments and enhances diversity through Sinkhorn-Knopp normalization.

Semantic Docids for Generative Retrieval

GenRet's approach to learning semantic docids addresses a fundamental challenge in generative retrieval: capturing the complete semantic information of a document in a compact, discrete representation. Unlike rule-based methods that rely on fixed tokenization schemes, GenRet learns to encode document semantics into docids through an end-to-end training process.

The semantic docids generated by GenRet offer several advantages:

Improved retrieval performance: By encoding rich semantic information, these docids enable more accurate matching between queries and relevant documents.
Generalizability: The learned tokenization method demonstrates better performance on unseen documents compared to fixed rule-based approaches.
Efficient representation: GenRet compresses document semantics into short discrete representations, facilitating faster retrieval and reduced storage requirements.
End-to-end optimization: The semantic docids are learned jointly with the retrieval model, allowing for a more cohesive and optimized retrieval system.

Progressive Training for Autoregressive Models

Progressive training for autoregressive models enhances their ability to generate high-quality, high-resolution outputs while maintaining stability during the learning process. This approach involves gradually increasing the complexity of the model by adding layers and expanding the input/output dimensions over time. For image generation tasks, the process begins with low-resolution images (e.g., 4x4 pixels) and progressively scales up to larger sizes, allowing the model to learn coarse-to-fine details efficiently.

Key aspects of progressive training for autoregressive models include:

Phased layer addition: New layers are smoothly integrated using skip connections and weighted contributions, controlled by an alpha parameter that increases from 0 to 1 over training iterations.
Continuous trainability: All layers, including existing ones, remain trainable throughout the process, ensuring adaptability as the model grows.
Time efficiency: By combining techniques like depthwise separable convolutions and super-resolution GANs, training time can be significantly reduced, especially for later, higher-resolution stages.
Improved stability: The gradual increase in model complexity helps avoid sudden shocks to well-trained lower-resolution layers, leading to more stable convergence.

Experimental Results: Outperforming the Competition

GenRet was evaluated on three benchmark datasets:

NQ320K: Wikipedia-based factoid QA dataset.

MS MARCO: Web search dataset with diverse queries and documents.

BEIR: A heterogeneous benchmark covering six distinct retrieval tasks.

*Dataset Details |Table Source: Image Source:* *"Learning to Tokenize for Generative Retrieval"*

Results

1. Performance on Seen vs Unseen Data (NQ320K):

GenRet achieved state-of-the-art results, significantly outperforming dense and generative baselines on both seen and unseen test sets.
It demonstrated robust generalization capabilities, bridging the gap between dense retrieval's precision and generative retrieval's flexibility.

2. Cross-Dataset Generalization (BEIR):

GenRet outperformed sparse and dense baselines like BM25 and Sentence-T5 on diverse downstream tasks.
It showed resilience in adapting to datasets with poorly defined metadata (e.g., BEIR-Covid).

3. Efficiency Analysis (MS MARCO):

GenRet exhibited lower memory requirements compared to dense retrieval methods due to its reliance on model parameters rather than large embeddings.
Online latency was reduced by generating shorter docids, making it suitable for real-time applications.

Analytical Experiments

*Left: Docid distribution on NQ320K. The id are sorted by the assigned frequency. Right: Ablation study on NQ320K | Image Source:* *"Learning to Tokenize for Generative Retrieval"*

Docid Diversity: GenRet achieved superior diversity in docid assignments compared to baseline methods. This diversity ensures better semantic segmentation and reduces conflicts during retrieval.

Ablation Studies: Removing components like progressive training or diverse clustering (Github) resulted in significant performance drops, especially on unseen data. This highlights the importance of GenRet's integrated design.

Qualitative Analysis

Left: Document titles along with their corresponding docids. It is observed that documents with similar docids tend to have more relevant content. Right: Word cloud representing documents grouped by docid prefixes. This illustrates that different positions of the docid correspond to different levels of information, and the semantics within each cluster are closely related. | Image Source: *"Learning to Tokenize for Generative Retrieval"*

The hierarchical structure within learned docids was evident:

Documents sharing prefixes exhibited semantic similarity at varying granularities.
Word clouds grouped by docid prefixes visually demonstrated how GenRet captures nuanced relationships between documents.

Implications for Recommendation Systems

Recommendation systems often face challenges similar to those in document retrieval—matching user queries with relevant items across diverse datasets. GenRet's ability to learn semantic representations and generalize across unseen data offers several advantages:

1. Improved Cold Start Handling: Semantic docids can enhance recommendations for new items without extensive retraining.

2. Scalability: The compact representation of items as docids reduces computational overhead, enabling real-time recommendations.

3. Cross-Domain Adaptability: GenRet's success across heterogeneous datasets suggests potential for multi-domain recommendation systems.

Implications for the Future of Search

While GenRet establishes a strong foundation for generative retrieval, several avenues remain unexplored:

Scaling to larger datasets with billions of documents.
Dynamic adaptation of docid prefixes for evolving document collections.
Integration with pre-trained large-scale language models for enhanced tokenization.

Advancing Generative IR Paradigms

GenRet represents a significant advancement in generative retrieval, addressing the critical challenge of document tokenization through a novel learning approach. By encoding complete document semantics into discrete docids, GenRet outperforms existing rule-based methods, particularly on unseen documents. The discrete auto-encoding framework, comprising tokenization, reconstruction, and retrieval models, enables end-to-end optimization of the retrieval process.

GenRet's success on benchmark datasets like NQ320K, MS MARCO, and BEIR demonstrates its potential to revolutionize information retrieval systems. As the field progresses, future research may focus on integrating GenRet with multi-modal data, scaling to larger document collections, and exploring zero-shot capabilities. These advancements could lead to more efficient, accurate, and versatile retrieval systems, bridging the gap between traditional index-based methods and emerging generative AI technologies in information retrieval.

Scaling Generative Retrieval Frontiers

GenRet's success opens up several promising research directions for advancing generative retrieval. Scaling to billion-document datasets presents a key challenge, requiring efficient indexing and retrieval strategies like those proposed for ColPali. One approach could leverage Vespa's phased retrieval pipeline with binary quantization to enable MaxSim computations over large-scale collections. Dynamic adaptation of docid prefixes is another critical area, potentially employing techniques like progressive training schemes to handle evolving document sets. Integration with pre-trained large language models (LLMs) could enhance tokenization by leveraging their semantic understanding. This could be achieved through methods like task-adaptive tokenization, which optimizes sampling probabilities based on task-specific data. Future work should also explore generalization to diverse document types and domains, perhaps utilizing multi-grained tokenization approaches like AMBERT to capture both fine and coarse-grained semantic information.