Static search trees: 40x faster than binary search
Performance Gains:
- Achieved 40x speedup compared to standard binary search implementation
- Single-threaded throughput: ~2.5ns per query (from ~100ns)
- Multi-threaded throughput: ~0.4ns per query using 8 threads
Key Optimizations:
- Eytzinger layout significantly improves cache efficiency
- Places root at index 1
- Children of node i at positions 2i and 2i+1
- Enables effective prefetching of next cache line
SIMD Improvements:
- Manual SIMD implementation outperforms auto-vectorization
- Using AVX2 instructions allows processing multiple elements simultaneously
- Trailing zeros counting with TZCNT instruction proved efficient
Memory Access Patterns:
- Cache line size (64 bytes) heavily influences performance
- Hugepages reduced TLB misses significantly
- Prefetching helps hide memory latency
- Node size B=15 provides optimal balance for modern architectures
Batching Strategy:
- Processing multiple queries simultaneously improves throughput
- Interleaving queries reduces memory stalls
- Optimal batch size depends on input size and hardware characteristics
Tree Layout Innovations:
- S+ trees store all data in bottom layer
- Prefix partitioning can help with skewed data distributions
- Compact subtrees reduce memory footprint
- Left-tree layout improves locality for certain access patterns
Implementation Details:
- Uses 32-bit unsigned integers for values
- Branchless implementation reduces pipeline stalls
- Pointer arithmetic optimizations reduce instruction count
- Memory alignment considerations affect performance
Limitations and Trade-offs:
- Performance becomes memory-bound with multiple threads
- Prefix partitioning benefits diminish with skewed input
- Additional memory requirements can be significant for larger inputs
- Static nature means no dynamic updates supported
Future Optimization Possibilities:
- Interpolation search could reduce iterations from $O(\log n)$ to $O(\log \log n)$
- Packing data into 16-bit values could increase branching factor
- Query sorting could improve cache utilization
- Range queries could be optimized for specific use cases