Performance exercise: Minimum

by Andrey Akinshin · 2016-12-20

Performance is tricky. Especially, if you are working with very fast operations. In today benchmarking exercise, we will try to measure performance of two simple methods which calculate minimum of two numbers. Sounds easy? Ok, let’s do it, here are our guinea pigs for today:

int MinTernary(int x, int y)  => x < y ? x : y;
int MinBitHacks(int x, int y) => x & ((x - y) >> 31) | y & (~(x - y) >> 31);

And here are some results:

RandomConst
TernaryBitHacksTernaryBitHacks
LegacyJIT-x86≈643µs≈227µs≈160µs≈226µs
LegacyJIT-x64≈450µs≈123µs≈68µs≈123µs
RyuJIT-x64≈594µs≈241µs≈180µs≈241µs
Mono-x64≈203µs≈283µs≈204µs≈282µs

What’s going on here? Let’s discuss it in detail.

Bit hacks

The first implementation looks obvious, but it has one significant problem: it could suffer from branch mispredictions because of a condition in the expression. Fortunately, it is possible to rewrite it without a branch with the help of bit hacks:

int MinBitHacks(int x, int y) => x & ((x - y) >> 31) | y & (~(x - y) >> 31);

Here we calculate (x-y), the sign of this expression depends on which number is less. Then, (x-y) >> 31 gives a bit mask which contains only zeros or ones. Next, we calculate an inverted mask: ~(x - y) >> 31. Now we and our operands and the corresponded bit masks (the minimum number get the 11...11 mask). That’s all: the or operator returns the correct result.

Here is an example for x=8 and y=3:

As you can see, here is no a branch here: we compute the minimum using only bit operations.

Performance spaces

It’s wrong to discuss the performance of some operations in general; we always should think about a space of the performance results (I call it performance space). Simplifying, you have to consider the following:

  • Source code: there are different ways to write a benchmark which includes the Min methods. Today we will take two int arrays a and b with some data and calculate the third array int[] c where c[i] = Min(a[i], b[i]) for each i.
  • Data: we always should check different data patterns. In our case, we are analyzing the branch predictor, so it makes sense to check const and random input patterns.
  • Environment: there are many different environments, today we will check only Full .NET Framework 4.6.2 (with 3 main JIT compilers: LegacyJIT-x86, LegacyJIT-x64, RyuJIT-x64) and Mono 4.6.2 on Windows.

Benchmarks

Here the source code of my benchmarks (based on BenchmarkDotNet v0.10.1):

[LegacyJitX86Job, LegacyJitX64Job, RyuJitX64Job, MonoJob]
public class MinBench
{
    const int N = 100001;

    private int[] a, b, c;

    public enum StrategyKind
    {
        Const, Random
    }

    [Params(StrategyKind.Const, StrategyKind.Random)]
    public StrategyKind Strategy;

    [Setup]
    public void Setup()
    {
        a = new int[N];
        b = new int[N];
        c = new int[N];
        var rnd = new Random(42);
        for (int i = 0; i < N; i++)
        {
            switch (Strategy)
            {
                case StrategyKind.Const:
                {
                    a[i] = 42;
                    b[i] = 42;
                }
                    break;
                case StrategyKind.Random:
                {
                    a[i] = rnd.Next();
                    b[i] = rnd.Next();
                }
                    break;
            }
        }
    }

    [Benchmark]
    public void Ternary()
    {
        for (int i = 0; i < N; i++)
        {
            int x = a[i], y = b[i];
            c[i] = x < y ? x : y;
        }
    }

    [Benchmark]
    public void BitHacks()
    {
        for (int i = 0; i < N; i++)
        {
            int x = a[i], y = b[i];
            c[i] = x & ((x - y) >> 31) | y & (~(x - y) >> 31);
        }
    }
}

Raw results:

BenchmarkDotNet=v0.10.1, OS=Microsoft Windows NT 6.2.9200.0
Processor=Intel(R) Core(TM) i7-4702MQ CPU 2.20GHz, ProcessorCount=8
Frequency=2143475 Hz, Resolution=466.5321 ns, Timer=TSC
  [Host]       : Clr 4.0.30319.42000, 32bit LegacyJIT-v4.6.1586.0
  LegacyJitX64 : Clr 4.0.30319.42000, 64bit LegacyJIT/clrjit-v4.6.1586.0;compatjit-v4.6.1586.0
  LegacyJitX86 : Clr 4.0.30319.42000, 32bit LegacyJIT-v4.6.1586.0
  Mono         : Mono 4.6.2 (Visual Studio built mono), 64bit
  RyuJitX64    : Clr 4.0.30319.42000, 64bit RyuJIT-v4.6.1586.0
MethodJobStrategyMeanStdDev
TernaryLegacyJitX86Random643.9113 us2.8095 us
BitHacksLegacyJitX86Random227.1344 us1.3270 us
TernaryLegacyJitX86Const160.0779 us0.9276 us
BitHacksLegacyJitX86Const225.7077 us0.7597 us
TernaryLegacyJitX64Random450.7977 us1.1618 us
BitHacksLegacyJitX64Random123.3894 us0.3052 us
TernaryLegacyJitX64Const68.6997 us0.7440 us
BitHacksLegacyJitX64Const123.0449 us0.7931 us
TernaryRyuJitX64Random594.5310 us1.1537 us
BitHacksRyuJitX64Random241.1466 us1.1446 us
TernaryRyuJitX64Const179.7262 us0.4236 us
BitHacksRyuJitX64Const240.8385 us0.7296 us
TernaryMonoRandom203.6173 us1.7580 us
BitHacksMonoRandom283.5624 us2.2254 us
TernaryMonoConst204.5277 us1.5814 us
BitHacksMonoConst282.5178 us1.9491 us

Results in a nice form:

RandomConst
TernaryBitHacksTernaryBitHacks
LegacyJIT-x86≈643µs≈227µs≈160µs≈226µs
LegacyJIT-x64≈450µs≈123µs≈68µs≈123µs
RyuJIT-x64≈594µs≈241µs≈180µs≈241µs
Mono-x64≈203µs≈283µs≈204µs≈282µs

Analysis

First, let’s look at the right part of the table with const input. Here Ternary is always faster than BitHacks because it takes a small amount of instruction and the branch predictor works perfectly. For the random input on Full .NET Framework the BitHacks is faster for all JIT compilers. And there is an explanation for this: Ternary has a big performance penalty because it’s hard to predict the correct branch. Performance of BitHacks is the same for both input patterns and it also makes sense: this method doesn’t depend on branch predictor. However, we could also make a few interesting observation about Mono:

  • Ternary works faster than BitHacks even on the random input.
  • Mono version of Ternary on the random input works much quicker than the same code on Full .NET Framework.
  • Mono shows the same performance for Ternary and BitHacks for both input arrays.

How is it possible? Let’s look at the asm. Here is the asm code for RyuJIT-x64:

; RyuJIT-x64
cmp       ecx,edx         ; check x < y
jl        LESS
mov       eax,edx         ; return y
ret
LESS:
mov       eax,ecx         ; return x
ret

It looks very simple. How does it possible to rewrite this code and make it faster? Let’s think. The bottleneck here is the jl instruction which has a significant penalty because of the high-value misprediction rate. Is it possible to rewrite it without conditional jumps? Yes! Conditional move to the rescue!

; Mono4.6.2-x64
sub       $0x18,%rsp
mov       %rsi,(%rsp)
mov       %rdi,0x8(%rsp)
mov       %rcx,%rdi
mov       %rdx,%rsi
cmp       %esi,%edi
mov       %rsi,%rax
cmovl     %rdi,%rax       ; Move if less (SF<>OF).
mov       (%rsp),%rsi
mov       0x8(%rsp),%rdi
add       $0x18,%rsp
retq

Here Mono uses the cmovl instruction (0F4C). So, it will not suffer from branch mispredictions because there is no branch on the asm level (despite we have a condition in the source C# code).

Conclusion

Thus, we can’t make a conclusion about MinTernary and MinBitHacks performance in general. It’s impossible to say which implementation is better for your problem without additional measurements because it depends on different conditions (like input data pattern and target runtime).

Is it a complete performance investigation? Of course, it’s not. We miss something in each component of our performance space:

  • Source: We considered synthetic methods which calculate minimums in a special way. We easily can do small changes which significantly affect results.
  • Data: We checked only two input patterns: a const pattern and a random pattern with a particular seed. If we take another input array (e.g. some data from real life), we get another performance picture.
  • Environment: We didn’t check Linux and MacOS, CoreCLR, different versions of Mono, and so on. Also, we checked only Intel Core i7 Haswell; another processor micro-architectures have own characteristics of the branch predictor which is the bottleneck in this benchmark.

However, we get a good result: we have shown a part of performance space for target operations, and we have found relevant conditions which could affect the speed of our program. Note that it was just a benchmarking exercise, I wanted to show that there are a lot of troubles with benchmarking even in this simple case. Be careful with your performance measurements.