Static search trees: 40x faster than binary search

Ragnar Groot Koerkamp · 2024-12-18

Performance Gains:

  • Achieved 40x speedup compared to standard binary search implementation
  • Single-threaded throughput: ~2.5ns per query (from ~100ns)
  • Multi-threaded throughput: ~0.4ns per query using 8 threads

Key Optimizations:

  • Eytzinger layout significantly improves cache efficiency
    • Places root at index 1
    • Children of node i at positions 2i and 2i+1
    • Enables effective prefetching of next cache line

SIMD Improvements:

  • Manual SIMD implementation outperforms auto-vectorization
  • Using AVX2 instructions allows processing multiple elements simultaneously
  • Trailing zeros counting with TZCNT instruction proved efficient

Memory Access Patterns:

  • Cache line size (64 bytes) heavily influences performance
  • Hugepages reduced TLB misses significantly
  • Prefetching helps hide memory latency
  • Node size B=15 provides optimal balance for modern architectures

Batching Strategy:

  • Processing multiple queries simultaneously improves throughput
  • Interleaving queries reduces memory stalls
  • Optimal batch size depends on input size and hardware characteristics

Tree Layout Innovations:

  • S+ trees store all data in bottom layer
  • Prefix partitioning can help with skewed data distributions
  • Compact subtrees reduce memory footprint
  • Left-tree layout improves locality for certain access patterns

Implementation Details:

  • Uses 32-bit unsigned integers for values
  • Branchless implementation reduces pipeline stalls
  • Pointer arithmetic optimizations reduce instruction count
  • Memory alignment considerations affect performance

Limitations and Trade-offs:

  • Performance becomes memory-bound with multiple threads
  • Prefix partitioning benefits diminish with skewed input
  • Additional memory requirements can be significant for larger inputs
  • Static nature means no dynamic updates supported

Future Optimization Possibilities:

  • Interpolation search could reduce iterations from $O(\log n)$ to $O(\log \log n)$
  • Packing data into 16-bit values could increase branching factor
  • Query sorting could improve cache utilization
  • Range queries could be optimized for specific use cases