Static search trees: 40x faster than binary search

Ragnar Groot Koerkamp · 2024-12-18

https://curiouscoding.nl/posts/static-search-tree/ · GitHub

Performance Gains:

Achieved 40x speedup compared to standard binary search implementation
Single-threaded throughput: ~2.5ns per query (from ~100ns)
Multi-threaded throughput: ~0.4ns per query using 8 threads

Key Optimizations:

Eytzinger layout significantly improves cache efficiency
- Places root at index 1
- Children of node i at positions 2i and 2i+1
- Enables effective prefetching of next cache line

SIMD Improvements:

Manual SIMD implementation outperforms auto-vectorization
Using AVX2 instructions allows processing multiple elements simultaneously
Trailing zeros counting with TZCNT instruction proved efficient

Memory Access Patterns:

Cache line size (64 bytes) heavily influences performance
Hugepages reduced TLB misses significantly
Prefetching helps hide memory latency
Node size B=15 provides optimal balance for modern architectures

Batching Strategy:

Processing multiple queries simultaneously improves throughput
Interleaving queries reduces memory stalls
Optimal batch size depends on input size and hardware characteristics

Tree Layout Innovations:

S+ trees store all data in bottom layer
Prefix partitioning can help with skewed data distributions
Compact subtrees reduce memory footprint
Left-tree layout improves locality for certain access patterns

Implementation Details:

Uses 32-bit unsigned integers for values
Branchless implementation reduces pipeline stalls
Pointer arithmetic optimizations reduce instruction count
Memory alignment considerations affect performance

Limitations and Trade-offs:

Performance becomes memory-bound with multiple threads
Prefix partitioning benefits diminish with skewed input
Additional memory requirements can be significant for larger inputs
Static nature means no dynamic updates supported

Future Optimization Possibilities:

Interpolation search could reduce iterations from $O(\log n)$ to $O(\log \log n)$
Packing data into 16-bit values could increase branching factor
Query sorting could improve cache utilization
Range queries could be optimized for specific use cases