Performance exercise: Division

Andrey Akinshin · 2016-12-26

In the previous post, we discussed the performance space of the minimum function which was implemented via a simple ternary operator and with the help of bit magic. Now we continue to talk about performance and bit hacks. In particular, we will divide a positive number by three:

uint Div3Simple(uint n)   => n / 3;
uint Div3BitHacks(uint n) => (uint)((n * (ulong)0xAAAAAAAB) >> 33);

As usual, it’s hard to say which method is faster in advanced because the performance depends on the environment. Here are some interesting results:

	Simple	BitHacks
LegacyJIT-x86	≈8.3ns	≈2.6ns
LegacyJIT-x64	≈2.6ns	≈1.7ns
RyuJIT-x64	≈6.9ns	≈1.5ns
Mono4.6.2-x86	≈8.5ns	≈14.4ns
Mono4.6.2-x64	≈8.3ns	≈2.8ns

Let’s try to understand why we could have such results. Note that it is just an exercise: we will not discuss which JIT or runtime is better, how we should write production code, and so on. Our main goal is to understand which kind of pitfalls we should expect during benchmarking and why it’s very important to check different environments.

Bit hacks

The first method (n / 3) looks very obvious. However, it includes an integer division which is an expensive operation. Fortunately, there is an old hack which allows replacing “Division by a constant” by “multiplication + bit shift”.
The basic idea is the following: we should hide the division inside a constant. In our case, 0xAAAAAAAB = RoundUp(2^33 / 3). Thus, we multiply n by 2^33 / 3 and divide it by 2^33 (>> 33). The result is equal to n/3.

Benchmarks

It’s always hard to benchmark tiny methods. There are many different ways to benchmark division, each of them will show own results. Today we will try to measure latency of our operations. It means that the next operation in the loop should be started only after the previous operation finished. So, we create a dependency chain that use the result of previous operation as an input for the next one. Such construction helps to prevent instruction-level parallelism. As usual, we design a benchmark with the help of BenchmarkDotNet:

[LegacyJitX86Job, LegacyJitX64Job, RyuJitX64Job, MonoJob]
public class DivBench
{
    private uint x = 1, initialValue = uint.MaxValue;

    [Benchmark(OperationsPerInvoke = 32)]
    public void Simple()
    {
        x = initialValue;
        x = x / 3;
        x = x / 3;
        x = x / 3;
        x = x / 3;
        x = x / 3;
        x = x / 3;
        x = x / 3;
        x = x / 3;
        x = x / 3;
        x = x / 3;
        x = x / 3;
        x = x / 3;
        x = x / 3;
        x = x / 3;
        x = x / 3;
        x = x / 3;
        x = x / 3;
        x = x / 3;
        x = x / 3;
        x = x / 3;
        x = x / 3;
        x = x / 3;
        x = x / 3;
        x = x / 3;
        x = x / 3;
        x = x / 3;
        x = x / 3;
        x = x / 3;
        x = x / 3;
        x = x / 3;
        x = x / 3;
        x = x / 3;
    }

    [Benchmark(OperationsPerInvoke = 32)]
    public void BitHacks()
    {
        x = initialValue;
        x = (uint) ((x * (ulong) 0xAAAAAAAB) >> 33);
        x = (uint) ((x * (ulong) 0xAAAAAAAB) >> 33);
        x = (uint) ((x * (ulong) 0xAAAAAAAB) >> 33);
        x = (uint) ((x * (ulong) 0xAAAAAAAB) >> 33);
        x = (uint) ((x * (ulong) 0xAAAAAAAB) >> 33);
        x = (uint) ((x * (ulong) 0xAAAAAAAB) >> 33);
        x = (uint) ((x * (ulong) 0xAAAAAAAB) >> 33);
        x = (uint) ((x * (ulong) 0xAAAAAAAB) >> 33);
        x = (uint) ((x * (ulong) 0xAAAAAAAB) >> 33);
        x = (uint) ((x * (ulong) 0xAAAAAAAB) >> 33);
        x = (uint) ((x * (ulong) 0xAAAAAAAB) >> 33);
        x = (uint) ((x * (ulong) 0xAAAAAAAB) >> 33);
        x = (uint) ((x * (ulong) 0xAAAAAAAB) >> 33);
        x = (uint) ((x * (ulong) 0xAAAAAAAB) >> 33);
        x = (uint) ((x * (ulong) 0xAAAAAAAB) >> 33);
        x = (uint) ((x * (ulong) 0xAAAAAAAB) >> 33);
        x = (uint) ((x * (ulong) 0xAAAAAAAB) >> 33);
        x = (uint) ((x * (ulong) 0xAAAAAAAB) >> 33);
        x = (uint) ((x * (ulong) 0xAAAAAAAB) >> 33);
        x = (uint) ((x * (ulong) 0xAAAAAAAB) >> 33);
        x = (uint) ((x * (ulong) 0xAAAAAAAB) >> 33);
        x = (uint) ((x * (ulong) 0xAAAAAAAB) >> 33);
        x = (uint) ((x * (ulong) 0xAAAAAAAB) >> 33);
        x = (uint) ((x * (ulong) 0xAAAAAAAB) >> 33);
        x = (uint) ((x * (ulong) 0xAAAAAAAB) >> 33);
        x = (uint) ((x * (ulong) 0xAAAAAAAB) >> 33);
        x = (uint) ((x * (ulong) 0xAAAAAAAB) >> 33);
        x = (uint) ((x * (ulong) 0xAAAAAAAB) >> 33);
        x = (uint) ((x * (ulong) 0xAAAAAAAB) >> 33);
        x = (uint) ((x * (ulong) 0xAAAAAAAB) >> 33);
        x = (uint) ((x * (ulong) 0xAAAAAAAB) >> 33);
        x = (uint) ((x * (ulong) 0xAAAAAAAB) >> 33);
    }
}

Note that x is a field here. Here is a little homework exercise for you: check out what happens, if we declare x as a local variable.

Another note: BenchmarkDotNet doesn’t allow to run a benchmark against two different version of Mono at the same time (hopefully, will be fixed soon). So, I launched it twice and combined output manually. Raw results on my laptop:

BenchmarkDotNet=v0.10.1, OS=Microsoft Windows NT 6.2.9200.0
Processor=Intel(R) Core(TM) i7-4702MQ CPU 2.20GHz, ProcessorCount=8
Frequency=2143475 Hz, Resolution=466.5321 ns, Timer=TSC
  LegacyJitX86 : Clr 4.0.30319.42000, 32bit LegacyJIT-v4.6.1586.0
  LegacyJitX64 : Clr 4.0.30319.42000, 64bit LegacyJIT/clrjit-v4.6.1586.0;compatjit-v4.6.1586.0
  RyuJitX64    : Clr 4.0.30319.42000, 64bit RyuJIT-v4.6.1586.0
  MonoX86      : Mono 4.6.2 (Visual Studio built mono), 32bit
  MonoX64      : Mono 4.6.2 (Visual Studio built mono), 64bit

Method	Job	Mean	StdDev
Simple	LegacyJitX86	8.3387 ns	0.1021 ns
BitHacks	LegacyJitX86	2.5719 ns	0.0049 ns
Simple	LegacyJitX64	2.5649 ns	0.0264 ns
BitHacks	LegacyJitX64	1.6595 ns	0.0136 ns
Simple	RyuJitX64	6.8538 ns	0.0352 ns
BitHacks	RyuJitX64	1.5413 ns	0.0138 ns
Simple	MonoX86	8.4779 ns	0.0136 ns
BitHacks	MonoX86	14.4434 ns	0.0935 ns
Simple	MonoX64	8.2883 ns	0.0590 ns
BitHacks	MonoX64	2.8300 ns	0.0101 ns

A nice form:

	Simple	BitHacks
LegacyJIT-x86	≈8.3ns	≈2.6ns
LegacyJIT-x64	≈2.6ns	≈1.7ns
RyuJIT-x64	≈6.9ns	≈1.5ns
Mono4.6.2-x86	≈8.5ns	≈14.4ns
Mono4.6.2-x64	≈8.3ns	≈2.8ns

Analysis

In the table, you can observe two performance mysteries:

LegacyJIT-x64+Simple: it works suspiciously quickly (the same order as in the BitHacks case).
Mono4.6.2-x86+BitHacks: it works slowly (BitHacks made the situation worse).

In other cases, BitHacks works 3–4 times faster than Simple which is nice. Now let’s look at the asm code of both methods in all cases. The total listings are huge (because of the 32 same operations in each method), so we will look only at a single iteration.

LegacyJIT-x86

; Simple
mov         eax,dword ptr ds:[0089C8FCh]  
mov         ecx,3  
xor         edx,edx  
div         eax,ecx  
mov         dword ptr ds:[0089C8FCh],eax

; BitHacks
mov         eax,dword ptr ds:[0089C8FCh]  
mov         edx,0AAAAAAABh  
mul         eax,edx  
mov         eax,edx  
shr         eax,1  
xor         edx,edx  
mov         dword ptr ds:[0089C8FCh],eax

As usual, LegacyJIT-x86 generates really simple code, no magic here.

RyuJIT-x64

; Simple
mov         ecx,3 ; Only before the first iteration  
xor         edx,edx  
div         eax,ecx  
mov         dword ptr [00007FFEEDB2338Ch],eax

; BitHacks
and         eax,0FFFFFFFFh  
imul        rax,rdx  
shr         rax,21h  
and         eax,0FFFFFFFFh  
mov         dword ptr [00007FFEEDB2338Ch],eax

RyuJIT-x64 also generates simple code, but it’s a little smarter than LegacyJIT-x86. Here we keep the result in eax and only flush it to memory at the end of an iteration. LegacyJIT-x86 also loads value from memory to eax at the start of the iteration.

LegacyJIT-x64

; Simple
mov         ecx,dword ptr [7FFEEDB2338Ch]  
mov         eax,0AAAAAAABh  
mul         eax,ecx  
shr         edx,1  
mov         dword ptr [7FFEEDB2338Ch],edx

; BitHacks
mov         eax,dword ptr [7FFEEDB2338Ch]  
mov         ecx,0AAAAAAABh  
imul        rax,rcx  
shr         rax,21h  
mov         dword ptr [7FFEEDB2338Ch],eax

LegacyJIT-x64 is smart enough to replace division by multiplication himself. Therefore LegacyJIT-x64+Simple works fast (the first mystery solved).

Mono4.6.2-x64

<Simple>:
1c:  movabs $0x1cbc89a1880,%rax
23: 
26:  mov    (%rax),%eax
28:  mov    $0x3,%ecx
2d:  xor    %rdx,%rdx
30:  div    %ecx
32:  mov    %rax,%rcx
35:  movabs $0x1cbc89a1880,%rax
3c: 
3f:  mov    %ecx,(%rax)

<BitHacks>:
1c:  movabs $0x1cbc89a1880,%rax
23:
26:  mov    (%rax),%eax
28:  mov    %eax,%ecx
2a:  mov    $0xaaaaaaab,%eax
2f:  imul   %rax,%rcx
33:  shr    $0x21,%rcx
37:  shr    $0x0,%ecx
3a:  movabs $0x1cbc89a1880,%rax
41:
44:  mov    %ecx,(%rax)

The asm code produced by Mono4.6.2-x64 doesn’t look so smart, but it’s still simple. The performance results are almost the same as in the LegacyJIT-x86 and RyuJIT-x64 cases.

Mono4.6.2-x86

<Simple>:
 d:  mov    0x5dc638(%rip),%eax        # 5dc64b <Simple+0x5dc64b>
13:  mov    $0x3,%ecx
18:  xor    %edx,%edx
1a:  div    %ecx
1c:  mov    %eax,%ecx
1e:  mov    $0x5dc638,%eax
23:  mov    %ecx,(%rax)

For some reasons, it’s not easy to show only one iteration from the original methods. Therefore we look at the full listings of the following one-iteration version of the BitHacks benchmark:

public void BitHacks()
{
    x = initialValue;
    x = (uint)((x * (ulong)0xAAAAAAAB) >> 33);
}

And here is the listing:

<BitHacks>:
 0:  push   %rbp
 1:  mov    %esp,%ebp
 3:  push   %rdi
 4:  sub    $0x34,%esp
 7:  mov    0x8(%rbp),%eax
 a:  mov    %eax,-0x10(%rbp)
 d:  movl   $0x0,-0xc(%rbp)
14:  movl   $0xaaaaaaab,-0x18(%rbp)
1b:  movl   $0x0,-0x14(%rbp)
22:  mov    -0x14(%rbp),%eax
25:  mov    %eax,0xc(%rsp)
29:  mov    -0x18(%rbp),%eax
2c:  mov    %eax,0x8(%rsp)
30:  mov    -0xc(%rbp),%eax
33:  mov    %eax,0x4(%rsp)
37:  mov    -0x10(%rbp),%eax
3a:  mov    %eax,(%rsp)
3d:  callq  c2b9cee <BitHacks+0xc2b9cee>
42:  mov    %edx,-0xc(%rbp)
45:  mov    %eax,-0x10(%rbp)
48:  mov    0x10378794(%rip),%eax        # 103787e2 <BitHacks+0x103787e2>
4e:  mov    -0x10(%rbp),%ecx
51:  mov    %ecx,-0x18(%rbp)
54:  mov    -0xc(%rbp),%ecx
57:  mov    %ecx,-0x14(%rbp)
5a:  test   %eax,%eax
5c:  jne    a0 <BitHacks+0xa0>
5e:  jmp    6c <BitHacks+0x6c>
60:  mov    -0x20(%rbp),%eax
63:  mov    %eax,-0x18(%rbp)
66:  mov    -0x1c(%rbp),%eax
69:  mov    %eax,-0x14(%rbp)
6c:  mov    -0x18(%rbp),%eax
6f:  mov    %eax,-0x10(%rbp)
72:  mov    -0x14(%rbp),%eax
75:  mov    %eax,-0xc(%rbp)
78:  mov    -0xc(%rbp),%eax
7b:  shr    %eax
7d:  lea    -0x4(%rbp),%esp
80:  lea    -0x4(%rbp),%esp
83:  pop    %rdi
84:  leaveq 
85:  retq   
86:  sub    $0xc,%esp
89:  push   %rdi
8a:  nop
8b:  callq  fffffffffe9d6110 <BitHacks+0xfffffffffe9d6110>
90:  movl   $0x0,-0x10(%rbp)
97:  movl   $0x0,-0xc(%rbp)
9e:  jmp    60 <BitHacks+0x60>
a0:  lea    0x0(%rbp),%ebp
a3:  callq  fffffffffffc6ae4 <BitHacks+0xfffffffffffc6ae4>
a8:  mov    %eax,%ecx
aa:  mov    -0x18(%rbp),%edx
ad:  mov    %edx,-0x20(%rbp)
b0:  mov    -0x14(%rbp),%edx
b3:  mov    %edx,-0x1c(%rbp)
b6:  mov    %ecx,%edi
b8:  test   %eax,%eax
ba:  jne    86 <BitHacks+0x86>
bc:  jmp    60 <BitHacks+0x60>

It doesn’t look as simple as previous listings. Mono4.6.2-x86 generates a huge amount of instructions which explains pure performance (the second mystery solved). A detailed description of the generated code is beyond the scope of this post, but you can do it yourself as an another homework. (Hint: check out how Mono4.6.2-x86 works with the uint*ulong operation.)

Conclusion

If you care about the performance of integer division, you should also care about your runtime and JIT compiler. LegacyJIT-x64 can apply the described optimization himself and produce fast code. If you wrote such optimization manually, Mono4.6.2-x86 will run it slowly because it has some troubles with the 64bit multiplication.

Please, don’t make general conclusions about .NET, Mono, or a particular JIT compiler from this post. Main idea of my performance exercises is to show how hard microbenchmarking can be. If you are trying to measure operations which take nanoseconds, you should think about all the target environments.