SOAR: Orthogonality-Amplified ANN Indexing

According to recent research "SOAR: Improved Indexing for Approximate Nearest Neighbor Search" by Google researchers, published in NeurIPS 2023, SOAR (Spilling with Orthogonality-Amplified Residuals) introduces a novel data indexing technique for approximate nearest neighbor (ANN) search, extending previous approaches like spill trees to optimize multiple redundant representations and significantly improve overall index quality.

Orthogonality-Amplified Residuals Explained

SOAR's key innovation lies in its orthogonality-amplified residual loss, which optimizes each representation to compensate for cases where others perform poorly. 

Image Source: NeurIPS

This approach addresses the issue of correlated query-residual angles in vector quantization (VQ) assignments. The SOAR loss function is defined as:

where  r′  is the new residual, r is the original residual, and w is a weight function emphasizing higher query-residual angles. This loss encourages orthogonality between residuals, as demonstrated by the relationship:

∥projrr′∥=∥r′∥⋅ρ⟨q,r⟩,⟨q,r′⟩

where ρ is the Pearson correlation coefficient. By minimizing correlated errors in quantized scores, SOAR achieves state-of-the-art ANN benchmark performance while maintaining fast indexing times and low memory consumption.

Comparison with Traditional Spill Trees

SOAR improves upon traditional spill trees in several key aspects. Unlike spill trees, which perform multiple assignments at each level of the tree, potentially leading to exponential storage overhead, SOAR conducts assignments at individual levels, resulting in a constant storage overhead with respect to tree depth. This approach typically yields a more manageable 10-20% storage increase. SOAR's custom, spilling-aware assignment loss enables intelligent decisions about which vector quantization (VQ) partitions to spill to, in contrast to the binary spill-or-not decision in traditional spill trees. 

Image Source: Research paper  "SOAR: Improved Indexing for Approximate Nearest Neighbor Search" 

Additionally, SOAR's orthogonality-amplified residual loss optimizes each representation to compensate for weaknesses in others, addressing the issue of correlated query-residual angles that can limit the effectiveness of independent redundant representations.

  • Lower storage overhead: Constant with tree depth vs. potentially exponential in spill trees
  • Intelligent spilling: Custom loss function for optimal VQ partition selection
  • Improved representation quality: Orthogonality-amplified residuals reduce correlated errors
  • State-of-the-art performance: Outperforms standard VQ indices and other ANN search approaches in benchmarks

State-of-the-Art ANN Benchmarks

SOAR's performance has been rigorously evaluated using established ANN benchmarks, demonstrating significant improvements over existing methods. On the ann-benchmarks.com platform, which provides standardized tests for various ANNS algorithms, SOAR consistently outperformed other approaches across multiple datasets and metrics. 

Image Source: Research paper  "SOAR: Improved Indexing for Approximate Nearest Neighbor Search" 

Notably, SOAR achieved state-of-the-art results on the glove-100 dataset, surpassing competitors in the critical trade-off between recall and queries per second.

The Big ANN Challenge at NeurIPS 2023 further validated SOAR's capabilities, where it excelled in tracks focusing on filtered search, out-of-distribution data, and sparse variants of ANNS. SOAR's implementation in ScaNN (Scalable Nearest Neighbors) maintained low memory consumption and fast indexing speeds while significantly improving query throughput compared to libraries with similar indexing times. This performance demonstrates SOAR's effectiveness in addressing real-world ANN search challenges, particularly in scenarios requiring high accuracy, low latency, and efficient resource utilization.

Benchmark Results

SOAR demonstrates exceptional performance in standardized benchmarks:

  • Outperforms existing state-of-the-art algorithms
  • Achieves up to 4.32x improvement in search efficiency on billion-scale datasets
  • Maintains superior cost-effectiveness in both hardware and cloud deployment scenarios
Table Source: Research paper  "SOAR: Improved Indexing for Approximate Nearest Neighbor Search" 

SOAR: Advancing Vector Search

SOAR represents a significant advancement in vector search technology, providing ScaNN with a robust "backup" route for identifying nearest neighbors when traditional clustering-based approaches falter. This enhancement allows ScaNN to perform even faster vector searches while maintaining low index size and indexing time, resulting in an optimal balance of performance metrics.

Image Source: Google Research "SOAR" 

ScaNN, with SOAR, is now available as an open-source project on GitHub and can be easily installed via pip. Furthermore, Google Cloud has incorporated ScaNN's vector search technology into its product ecosystem. Vertex AI Vector Search utilizes ScaNN to offer a fully managed, high-scale, low-latency vector similarity matching service. Additionally, AlloyDB has launched the ScaNN for AlloyDB index, providing a vector database solution built on the popular PostgreSQL-compatible database. These integrations demonstrate the practical applicability of SOAR and ScaNN in real-world, large-scale systems, paving the way for more efficient vector search to enable the next generation of machine learning applications.

For researchers and practitioners working on large-scale information retrieval, recommendation systems, or any application requiring efficient similarity search, SOAR offers a promising new direction that merits serious consideration and further exploration.

Get up and running with one engineer in one sprint

Guaranteed lift within your first 30 days or your money back

100M+
Users and items
1000+
Queries per second
1B+
Requests

Related Posts

Nic Scheltema
 | 
September 25, 2024

Learning to Rank for Recommender Systems: A Practical Guide

Tullie Murrell
 | 
November 19, 2024

Shaped Launches Semantic Search with Behavioral Re-ranking

Javier Jorge Cano
 | 
January 24, 2023

Whisper 🤫 : A multilingual and multitask robust ASR model