## Unrolling of small loops in different JIT versions

Challenge of the day: what will the following code display?

``````struct Point
{
public int X;
public int Y;
}
static void Print(Point p)
{
Console.WriteLine(p.X + " " + p.Y);
}
static void Main()
{
var p = new Point();
for (p.X = 0; p.X < 2; p.X++)
Print(p);
}
``````

The right answer: it depends. There is a bug in CLR2 JIT-x86 which spoil this wonderful program. This story is about optimization that called unrolling of small loops. This is a very interesting theme, let's discuss it in detail.

### Experiment 1 (Sum)

In the previous post, I told how loop unrolling works. There is a special case of this optimization: unrolling of small loops: if you have small amount of iterations, JIT can completely eliminate the loop by repeating its body several times. Let's discuss it with the following simple code:

``````int sum = 0;
for (int i = 0; i < 4; i++)
sum += i;
Console.WriteLine(sum);
``````

#### JIT-x86

Let's run the code with CLR4 + JIT-x86:

``````        int sum = 0;
0125291A  in          al,dx
0125291B  push        esi
0125291C  xor         esi,esi
sum += i;
0125291E  inc         esi                      ; sum += 1
0125291F  inc         esi                      ; sum += 2 (Part 1)
01252920  inc         esi                      ; sum += 2 (Part 2)
01252921  add         esi,3                    ; sum += 3
Console.WriteLine(sum);
01252924  call        72EE0258
01252929  mov         ecx,eax
0125292B  mov         edx,esi
0125292D  mov         eax,dword ptr [ecx]
0125292F  mov         eax,dword ptr [eax+38h]
01252932  call        dword ptr [eax+14h]
01252935  pop         esi
01252936  pop         ebp
01252937  ret
``````

As we can see, there are no branches in the program, the loop was completely eliminate. Instead of the loop, we have special instructions that calculate a value of the `sum` variable in the `esi` register.

#### JIT-x64

Next, CLR4 + JIT-x64:

``````        int sum = 0;
00007FFCC86F3EC0  sub         rsp,28h
Console.WriteLine(sum);
00007FFCC86F3EC4  mov         ecx,6             ; sum = 6
00007FFCC86F3EC9  call        00007FFD273DCF10
00007FFCC86F3ECE  nop
00007FFCC86F3ED3  ret
``````

JIT-x64 have done wonderful job:: he guessed that `0+1+2+3=6`. As a result, `6` displays without any arithmetic operations for its calculation.

#### RyuJIT CTP5

Next, CLR4 + RyuJIT CTP5:

``````        int sum = 0;
00007FFCC8713A02  sub         esp,20h
00007FFCC8713A05  xor         esi,esi
for (int i = 0; i < 4; i++)
00007FFCC8713A07  xor         eax,eax
sum += i;
for (int i = 0; i < 4; i++)
00007FFCC8713A0B  inc         eax
00007FFCC8713A0D  cmp         eax,4
00007FFCC8713A10  jl          00007FFCC8713A09
Console.WriteLine(sum);
00007FFCC8713A12  call        00007FFD26C0AFA0
00007FFCC8713A17  mov         rcx,rax
00007FFCC8713A1A  mov         edx,esi
00007FFCC8713A1C  mov         rax,qword ptr [rax]
00007FFCC8713A1F  mov         rax,qword ptr [rax+60h]
00007FFCC8713A23  call        qword ptr [rax+28h]
00007FFCC8713A26  nop
00007FFCC8713A2B  pop         rsi
00007FFCC8713A2C  ret
``````

Hey, RyuJIT, what is wrong with you? The optimization looks very simple, but you did not it! Well, we hope that you will learn it in the final release.

### Experiment 2 (Point)

Now, let's back to the first code snippet:

``````struct Point
{
public int X;
public int Y;
}

static void Print(Point p)
{
Console.WriteLine(p.X + " " + p.Y);
}

static void Main()
{
var p = new Point();
for (p.X = 0; p.X < 2; p.X++)
Print(p);
}
``````

Logic suggests us that the following output will be display:

``````0 0
1 0
``````

Let's check it!

#### CLR2 + JIT-x86

Let's start with the configuration in which there is a bug: CLR2 + JIT-x86. In this case, the following output will be display:

``````2 0
2 0
``````

Open the assembler code:

``````        var p = new Point();
05C5178C  push        esi
05C5178D  xor         esi,esi                 ; p.Y = 0
for (p.X = 0; p.X < 2; p.X++)
05C5178F  lea         edi,[esi+2]             ; p.X = 2
Print(p);
05C51792  push        esi                     ; push p.Y
05C51793  push        edi                     ; push p.X
05C51794  call        dword ptr ds:[54607F4h] ; Print(p)
05C5179A  push        esi                     ; push p.Y
05C5179B  push        edi                     ; push p.X
05C5179C  call        dword ptr ds:[54607F4h] ; Print(p)
05C517A2  pop         esi
05C517A3  pop         edi
05C517A4  pop         ebp
05C517A5  ret
``````

As we can see, loop unrolling have been performed, but the optimization has an error: variable `x` immediately takes the value`2` and never changing. This bug has long been known, it have been discussed in the question on StackOverflow [.NET JIT potential error?] (Http://stackoverflow.com/q/2056948/184842).

Let's check other JIT versions.

#### CLR4 + JIT-x86

``````        var p = new Point();
01392620  push        ebp
01392621  mov         ebp,esp
01392623  push        edi
01392624  push        esi
01392625  xor         edi,edi                  ; p.Y = 0
for (p.X = 0; p.X < 2; p.X++)
01392627  xor         esi,esi                  ; p.X = 0
Print(p);
01392629  push        edi                      ; push p.Y
0139262A  push        esi                      ; push p.X
0139262B  call        dword ptr ds:[2DB2108h]  ; Print(p)
for (p.X = 0; p.X < 2; p.X++)
01392631  inc         esi                      ; p.X++
01392632  cmp         esi,2
01392635  jl          01392629
01392637  pop         esi
01392638  pop         edi
01392639  pop         ebp
0139263A  ret
``````

Microsoft has fixed the bug in CLR4. Unfortunately we don't have loop unrolling in the new version of the code. It works correctly, though not so fast.

#### CLR2 + JIT-x64

``````        var p = new Point();
00007FFCB94A3502  in          al,dx
00007FFCB94A3503  cmp         byte ptr [rbx],dh
00007FFCB94A3505  ror         byte ptr [rax-77h],44h
00007FFCB94A3509  and         al,20h
00007FFCB94A350B  xor         eax,eax
00007FFCB94A350D  mov         qword ptr [rsp+20h],rax
for (p.X = 0; p.X < 2; p.X++)
00007FFCB94A3512  mov         dword ptr [rsp+20h],0
00007FFCB94A351A  mov         eax,dword ptr [rsp+20h]
00007FFCB94A351E  cmp         eax,2
00007FFCB94A3521  jge         00007FFCB94A3544
Print(p);
00007FFCB94A3523  mov         rcx,qword ptr [rsp+20h]
00007FFCB94A3528  call        00007FFCB936C868
for (p.X = 0; p.X < 2; p.X++)
00007FFCB94A352D  mov         r11d,dword ptr [rsp+20h]
00007FFCB94A3536  mov         dword ptr [rsp+20h],r11d
00007FFCB94A353B  mov         eax,dword ptr [rsp+20h]
00007FFCB94A353F  cmp         eax,2
00007FFCB94A3542  jl          00007FFCB94A3523
00007FFCB94A3548  rep ret
``````

Although this code works correctly, but it is very bad: loop unrolling haven't been applying, and we have big amount of unnecessary instruction with sending values between stack and registers. Let's see what has changed in CLR4.

#### CLR4 + JIT-x64

``````        var p = new Point();
00007FFCC8703EC2  sub         esp,30h
00007FFCC8703EC5  mov         qword ptr [rsp+20h],0
for (p.X = 0; p.X < 2; p.X++)
00007FFCC8703ECE  xor         ebx,ebx
00007FFCC8703ED0  mov         dword ptr [rsp+20h],ebx
00007FFCC8703ED4  cmp         ebx,2
00007FFCC8703ED7  jge         00007FFCC8703EF5
00007FFCC8703ED9  nop         dword ptr [rax]
Print(p);
00007FFCC8703EE0  mov         rcx,qword ptr [rsp+20h]
00007FFCC8703EE5  call        00007FFCC85EC8E0
for (p.X = 0; p.X < 2; p.X++)
00007FFCC8703EEA  inc         ebx
00007FFCC8703EEC  mov         dword ptr [rsp+20h],ebx
00007FFCC8703EF0  cmp         ebx,2
00007FFCC8703EF3  jl          00007FFCC8703EE0
00007FFCC8703EF9  pop         rbx
00007FFCC8703EFA  ret
``````

We still don't have loop unrolling. However, code has become much better in comparison with CLR2 + JIT-x64. Note, the sending the `Point` structure to the `Print` method implements via 64-bit register instead of two 32-bit values on the stack in JIT-x86.

#### CLR4 + RyuJIT CTP5

Next, let's try RyuJIT CTP5:

``````        var p = new Point();
00007FFCC8723A02  sub         rsp,28h
00007FFCC8723A06  xor         esi,esi
for (p.X = 0; p.X < 2; p.X++)
00007FFCC8723A08  xor         edi,edi
Print(p);
00007FFCC8723A0A  lea         rcx,[rsp+20h]
00007FFCC8723A0F  mov         dword ptr [rcx],edi
00007FFCC8723A11  mov         dword ptr [rcx+4],esi
00007FFCC8723A14  mov         rcx,qword ptr [rsp+20h]
00007FFCC8723A19  call        00007FFCC860C8E0
for (p.X = 0; p.X < 2; p.X++)
00007FFCC8723A1E  inc         edi
00007FFCC8723A20  cmp         edi,2
00007FFCC8723A23  jl          00007FFCC8723A0A
00007FFCC8723A29  pop         rsi
00007FFCC8723A2A  pop         rdi
00007FFCC8723A2B  ret
``````

We don't have loop unrolling. However, code looks a little cleaner with comparison with CLR4 + JIT-x64.

### Summary

We show the results of the experiments in the following table:

Experiment CLR JIT Results
Sum 4 x86 Loop unrolling
Sum 4 x64 Value precalculation
Sum 4 RuyJIT Loop unrolling is absence
Point 2 x86 Loop unrolling with the bug
Point 4 x86 Loop unrolling is absence
Point 2 x64 Loop unrolling is absence, poor code quality
Point 4 x64 Loop unrolling is absence, medium code quality
Point 4 RyuJIT Loop unrolling is absence, good code quality
Share:
You can find source code of this post on GitHub:
https://github.com/AndreyAkinshin/aakinshin.net/blob/master/en/_posts/dotnet/unrolling-of-small-loops-in-different-jit-versions.md