Unrolling of small loops in different JIT versions



Challenge of the day: what will the following code display?

struct Point
{
    public int X;
    public int Y;
}
static void Print(Point p)
{
    Console.WriteLine(p.X + " " + p.Y);
}
static void Main()
{
    var p = new Point();
    for (p.X = 0; p.X < 2; p.X++)
        Print(p);
}

The right answer: it depends. There is a bug in CLR2 JIT-x86 which spoil this wonderful program. This story is about optimization that called unrolling of small loops. This is a very interesting theme, let's discuss it in detail.

Experiment 1 (Sum)

In the previous post, I told how loop unrolling works. There is a special case of this optimization: unrolling of small loops: if you have small amount of iterations, JIT can completely eliminate the loop by repeating its body several times. Let's discuss it with the following simple code:

int sum = 0;
for (int i = 0; i < 4; i++)
    sum += i;
Console.WriteLine(sum);

JIT-x86

Let's run the code with CLR4 + JIT-x86:

        int sum = 0;                           
0125291A  in          al,dx                    
0125291B  push        esi                      
0125291C  xor         esi,esi                  
            sum += i;                          
0125291E  inc         esi                      ; sum += 1
0125291F  inc         esi                      ; sum += 2 (Part 1)
01252920  inc         esi                      ; sum += 2 (Part 2)
01252921  add         esi,3                    ; sum += 3
        Console.WriteLine(sum);                
01252924  call        72EE0258                 
01252929  mov         ecx,eax                  
0125292B  mov         edx,esi                  
0125292D  mov         eax,dword ptr [ecx]      
0125292F  mov         eax,dword ptr [eax+38h]  
01252932  call        dword ptr [eax+14h]      
01252935  pop         esi                      
01252936  pop         ebp                      
01252937  ret                                  

As we can see, there are no branches in the program, the loop was completely eliminate. Instead of the loop, we have special instructions that calculate a value of the sum variable in the esi register.

JIT-x64

Next, CLR4 + JIT-x64:

        int sum = 0;                            
00007FFCC86F3EC0  sub         rsp,28h           
        Console.WriteLine(sum);                 
00007FFCC86F3EC4  mov         ecx,6             ; sum = 6
00007FFCC86F3EC9  call        00007FFD273DCF10  
00007FFCC86F3ECE  nop                           
00007FFCC86F3ECF  add         rsp,28h           
00007FFCC86F3ED3  ret                           

JIT-x64 have done wonderful job:: he guessed that 0+1+2+3=6. As a result, 6 displays without any arithmetic operations for its calculation.

RyuJIT CTP5

Next, CLR4 + RyuJIT CTP5:

        int sum = 0;
00007FFCC8713A02  sub         esp,20h  
00007FFCC8713A05  xor         esi,esi  
        for (int i = 0; i < 4; i++)
00007FFCC8713A07  xor         eax,eax  
            sum += i;
00007FFCC8713A09  add         esi,eax  
        for (int i = 0; i < 4; i++)
00007FFCC8713A0B  inc         eax  
00007FFCC8713A0D  cmp         eax,4  
00007FFCC8713A10  jl          00007FFCC8713A09  
        Console.WriteLine(sum);
00007FFCC8713A12  call        00007FFD26C0AFA0  
00007FFCC8713A17  mov         rcx,rax  
00007FFCC8713A1A  mov         edx,esi  
00007FFCC8713A1C  mov         rax,qword ptr [rax]  
00007FFCC8713A1F  mov         rax,qword ptr [rax+60h]  
00007FFCC8713A23  call        qword ptr [rax+28h]  
00007FFCC8713A26  nop  
00007FFCC8713A27  add         rsp,20h  
00007FFCC8713A2B  pop         rsi  
00007FFCC8713A2C  ret  

Hey, RyuJIT, what is wrong with you? The optimization looks very simple, but you did not it! Well, we hope that you will learn it in the final release.

Experiment 2 (Point)

Now, let's back to the first code snippet:

struct Point
{
    public int X;
    public int Y;
}

static void Print(Point p)
{
    Console.WriteLine(p.X + " " + p.Y);
}

static void Main()
{
    var p = new Point();
    for (p.X = 0; p.X < 2; p.X++)
        Print(p);
}

Logic suggests us that the following output will be display:

0 0
1 0

Let's check it!

CLR2 + JIT-x86

Let's start with the configuration in which there is a bug: CLR2 + JIT-x86. In this case, the following output will be display:

2 0
2 0

Open the assembler code:

        var p = new Point();                  
05C5178C  push        esi                     
05C5178D  xor         esi,esi                 ; p.Y = 0
        for (p.X = 0; p.X < 2; p.X++)         
05C5178F  lea         edi,[esi+2]             ; p.X = 2
            Print(p);                         
05C51792  push        esi                     ; push p.Y
05C51793  push        edi                     ; push p.X
05C51794  call        dword ptr ds:[54607F4h] ; Print(p)
05C5179A  push        esi                     ; push p.Y
05C5179B  push        edi                     ; push p.X
05C5179C  call        dword ptr ds:[54607F4h] ; Print(p)
05C517A2  pop         esi                     
05C517A3  pop         edi                     
05C517A4  pop         ebp                     
05C517A5  ret                                 

As we can see, loop unrolling have been performed, but the optimization has an error: variable x immediately takes the value2 and never changing. This bug has long been known, it have been discussed in the question on StackOverflow [.NET JIT potential error?] (Http://stackoverflow.com/q/2056948/184842).

Let's check other JIT versions.

CLR4 + JIT-x86

        var p = new Point();                   
01392620  push        ebp                      
01392621  mov         ebp,esp                  
01392623  push        edi                      
01392624  push        esi                      
01392625  xor         edi,edi                  ; p.Y = 0
        for (p.X = 0; p.X < 2; p.X++)          
01392627  xor         esi,esi                  ; p.X = 0
            Print(p);                          
01392629  push        edi                      ; push p.Y
0139262A  push        esi                      ; push p.X
0139262B  call        dword ptr ds:[2DB2108h]  ; Print(p)
        for (p.X = 0; p.X < 2; p.X++)          
01392631  inc         esi                      ; p.X++
01392632  cmp         esi,2                    
01392635  jl          01392629                 
01392637  pop         esi                      
01392638  pop         edi                      
01392639  pop         ebp                      
0139263A  ret                                  

Microsoft has fixed the bug in CLR4. Unfortunately we don't have loop unrolling in the new version of the code. It works correctly, though not so fast.

CLR2 + JIT-x64

        var p = new Point();
00007FFCB94A3502  in          al,dx  
00007FFCB94A3503  cmp         byte ptr [rbx],dh  
00007FFCB94A3505  ror         byte ptr [rax-77h],44h  
00007FFCB94A3509  and         al,20h  
00007FFCB94A350B  xor         eax,eax  
00007FFCB94A350D  mov         qword ptr [rsp+20h],rax  
        for (p.X = 0; p.X < 2; p.X++)
00007FFCB94A3512  mov         dword ptr [rsp+20h],0  
00007FFCB94A351A  mov         eax,dword ptr [rsp+20h]  
00007FFCB94A351E  cmp         eax,2  
00007FFCB94A3521  jge         00007FFCB94A3544  
            Print(p);
00007FFCB94A3523  mov         rcx,qword ptr [rsp+20h]  
00007FFCB94A3528  call        00007FFCB936C868  
        for (p.X = 0; p.X < 2; p.X++)
00007FFCB94A352D  mov         r11d,dword ptr [rsp+20h]  
00007FFCB94A3532  add         r11d,1  
00007FFCB94A3536  mov         dword ptr [rsp+20h],r11d  
00007FFCB94A353B  mov         eax,dword ptr [rsp+20h]  
00007FFCB94A353F  cmp         eax,2  
00007FFCB94A3542  jl          00007FFCB94A3523  
00007FFCB94A3544  add         rsp,38h  
00007FFCB94A3548  rep ret  

Although this code works correctly, but it is very bad: loop unrolling haven't been applying, and we have big amount of unnecessary instruction with sending values between stack and registers. Let's see what has changed in CLR4.

CLR4 + JIT-x64

        var p = new Point();                           
00007FFCC8703EC2  sub         esp,30h                  
00007FFCC8703EC5  mov         qword ptr [rsp+20h],0    
        for (p.X = 0; p.X < 2; p.X++)                  
00007FFCC8703ECE  xor         ebx,ebx                  
00007FFCC8703ED0  mov         dword ptr [rsp+20h],ebx  
00007FFCC8703ED4  cmp         ebx,2                    
00007FFCC8703ED7  jge         00007FFCC8703EF5         
00007FFCC8703ED9  nop         dword ptr [rax]          
            Print(p);                                  
00007FFCC8703EE0  mov         rcx,qword ptr [rsp+20h]  
00007FFCC8703EE5  call        00007FFCC85EC8E0         
        for (p.X = 0; p.X < 2; p.X++)                  
00007FFCC8703EEA  inc         ebx                      
00007FFCC8703EEC  mov         dword ptr [rsp+20h],ebx  
00007FFCC8703EF0  cmp         ebx,2                    
00007FFCC8703EF3  jl          00007FFCC8703EE0         
00007FFCC8703EF5  add         rsp,30h                  
00007FFCC8703EF9  pop         rbx                      
00007FFCC8703EFA  ret                                  

We still don't have loop unrolling. However, code has become much better in comparison with CLR2 + JIT-x64. Note, the sending the Point structure to the Print method implements via 64-bit register instead of two 32-bit values on the stack in JIT-x86.

CLR4 + RyuJIT CTP5

Next, let's try RyuJIT CTP5:

        var p = new Point();
00007FFCC8723A02  sub         rsp,28h  
00007FFCC8723A06  xor         esi,esi  
        for (p.X = 0; p.X < 2; p.X++)
00007FFCC8723A08  xor         edi,edi  
            Print(p);
00007FFCC8723A0A  lea         rcx,[rsp+20h]  
00007FFCC8723A0F  mov         dword ptr [rcx],edi  
00007FFCC8723A11  mov         dword ptr [rcx+4],esi  
00007FFCC8723A14  mov         rcx,qword ptr [rsp+20h]  
00007FFCC8723A19  call        00007FFCC860C8E0  
        for (p.X = 0; p.X < 2; p.X++)
00007FFCC8723A1E  inc         edi  
00007FFCC8723A20  cmp         edi,2  
00007FFCC8723A23  jl          00007FFCC8723A0A  
00007FFCC8723A25  add         rsp,28h  
00007FFCC8723A29  pop         rsi  
00007FFCC8723A2A  pop         rdi  
00007FFCC8723A2B  ret  

We don't have loop unrolling. However, code looks a little cleaner with comparison with CLR4 + JIT-x64.

Summary

We show the results of the experiments in the following table:

Experiment CLR JIT Results
Sum 4 x86 Loop unrolling
Sum 4 x64 Value precalculation
Sum 4 RuyJIT Loop unrolling is absence
Point 2 x86 Loop unrolling with the bug
Point 4 x86 Loop unrolling is absence
Point 2 x64 Loop unrolling is absence, poor code quality
Point 4 x64 Loop unrolling is absence, medium code quality
Point 4 RyuJIT Loop unrolling is absence, good code quality
Share:
You can find source code of this post on GitHub:
https://github.com/AndreyAkinshin/aakinshin.net/blob/master/web/_posts/en/2015/unrolling-of-small-loops-in-different-jit-versions.md