RyuJIT RC and constant folding


Update: The below results are valid for the release version of RyuJIT in .NET Framework 4.6 without updates.

The challenge of the day: which method is faster?

public double Sqrt13()
{
    return Math.Sqrt(1) + Math.Sqrt(2) + Math.Sqrt(3) + Math.Sqrt(4) + Math.Sqrt(5) + 
           Math.Sqrt(6) + Math.Sqrt(7) + Math.Sqrt(8) + Math.Sqrt(9) + Math.Sqrt(10) + 
           Math.Sqrt(11) + Math.Sqrt(12) + Math.Sqrt(13);
}
public double Sqrt14()
{
    return Math.Sqrt(1) + Math.Sqrt(2) + Math.Sqrt(3) + Math.Sqrt(4) + Math.Sqrt(5) + 
           Math.Sqrt(6) + Math.Sqrt(7) + Math.Sqrt(8) + Math.Sqrt(9) + Math.Sqrt(10) + 
           Math.Sqrt(11) + Math.Sqrt(12) + Math.Sqrt(13) + Math.Sqrt(14);
}

I have measured the methods performance with help of BenchmarkDotNet for RyuJIT RC (a part of .NET Framework 4.6 RC) and received the following results:

// BenchmarkDotNet=v0.7.4.0
// OS=Microsoft Windows NT 6.2.9200.0
// Processor=Intel(R) Core(TM) i7-4702MQ CPU @ 2.20GHz, ProcessorCount=8
// CLR=MS.NET 4.0.30319.0, Arch=64-bit  [RyuJIT]
Common:  Type=Math_DoubleSqrtAvx  Mode=Throughput  Platform=X64  Jit=RyuJit  .NET=Current  

 Method |  AvrTime |    StdDev |         op/s |
------- |--------- |---------- |------------- |
 Sqrt13 | 55.40 ns |  0.571 ns |  18050993.06 |
 Sqrt14 |  1.43 ns | 0.0224 ns | 697125029.18 |

How so? If I add one more Math.Sqrt to the expression, the method starts work 40 times faster! Let’s examine the situation..

First of all, open in VisualStudio generated ASM code:

; Sqrt13
vsqrtsd     xmm0,xmm0,mmword ptr [7FF94F9E4D28h]  
vsqrtsd     xmm1,xmm0,mmword ptr [7FF94F9E4D30h]  
vaddsd      xmm0,xmm0,xmm1  
vsqrtsd     xmm1,xmm0,mmword ptr [7FF94F9E4D38h]  
vaddsd      xmm0,xmm0,xmm1  
vsqrtsd     xmm1,xmm0,mmword ptr [7FF94F9E4D40h]  
vaddsd      xmm0,xmm0,xmm1  
vsqrtsd     xmm1,xmm0,mmword ptr [7FF94F9E4D48h]  
vaddsd      xmm0,xmm0,xmm1  
vsqrtsd     xmm1,xmm0,mmword ptr [7FF94F9E4D50h]  
vaddsd      xmm0,xmm0,xmm1  
vsqrtsd     xmm1,xmm0,mmword ptr [7FF94F9E4D58h]  
vaddsd      xmm0,xmm0,xmm1  
vsqrtsd     xmm1,xmm0,mmword ptr [7FF94F9E4D60h]  
vaddsd      xmm0,xmm0,xmm1  
vsqrtsd     xmm1,xmm0,mmword ptr [7FF94F9E4D68h]  
vaddsd      xmm0,xmm0,xmm1  
vsqrtsd     xmm1,xmm0,mmword ptr [7FF94F9E4D70h]  
vaddsd      xmm0,xmm0,xmm1  
vsqrtsd     xmm1,xmm0,mmword ptr [7FF94F9E4D78h]  
vaddsd      xmm0,xmm0,xmm1  
vsqrtsd     xmm1,xmm0,mmword ptr [7FF94F9E4D80h]  
vaddsd      xmm0,xmm0,xmm1  
vsqrtsd     xmm1,xmm0,mmword ptr [7FF94F9E4D88h]  
vaddsd      xmm0,xmm0,xmm1  
ret

; Sqrt14
vmovsd      xmm0,qword ptr [7FF94F9C4C80h]  
ret    

How so? It seems that an expression with 13 of Math.Sqrt calculates honestly and an expression with 14 of Math.Sqrt uses constant folding for result precalculation.

Next, let’s build own version of CoreCLR. I will work with the actual version for today (0e6021bb). COMPLUS_JitDisasm can help us to print generated ASM code:

; Sqrt13
sqrtsd   xmm0, qword ptr [RWD00]
sqrtsd   xmm1, qword ptr [RWD08]
addsd    xmm0, xmm1
sqrtsd   xmm1, qword ptr [RWD16]
addsd    xmm0, xmm1
sqrtsd   xmm1, qword ptr [RWD24]
addsd    xmm0, xmm1
sqrtsd   xmm1, qword ptr [RWD32]
addsd    xmm0, xmm1
sqrtsd   xmm1, qword ptr [RWD40]
addsd    xmm0, xmm1
sqrtsd   xmm1, qword ptr [RWD48]
addsd    xmm0, xmm1
sqrtsd   xmm1, qword ptr [RWD56]
addsd    xmm0, xmm1
sqrtsd   xmm1, qword ptr [RWD64]
addsd    xmm0, xmm1
sqrtsd   xmm1, qword ptr [RWD72]
addsd    xmm0, xmm1
sqrtsd   xmm1, qword ptr [RWD80]
addsd    xmm0, xmm1
sqrtsd   xmm1, qword ptr [RWD88]
addsd    xmm0, xmm1
sqrtsd   xmm1, qword ptr [RWD96]
addsd    xmm0, xmm1
ret

; Sqrt14
movsd    xmm0, qword ptr [RWD00]
ret

We can see that CoreCLR uses the sqrtsd instruction (SSE2) instead of vsqrtsd (AVX). It is not important for now. We can create an issue on GitHib (coreclr/issues/977) and continue (now there is a fix for the problem: coreclr/pull/981).

Now let’s enable COMPLUS_JitDump and print the full dump. We can see that RyuJIT build the following tree for Sqrt13:

*  stmtExpr  void  (top level) (IL 0x000...  ???)
|     /--*  mathFN    double sqrt
|     |  \--*  dconst    double 13.000000000000000
|  /--*  +         double
|  |  |  /--*  mathFN    double sqrt
|  |  |  |  \--*  dconst    double 12.000000000000000
|  |  \--*  +         double
|  |     |  /--*  mathFN    double sqrt
|  |     |  |  \--*  dconst    double 11.000000000000000
|  |     \--*  +         double
|  |        |  /--*  mathFN    double sqrt
|  |        |  |  \--*  dconst    double 10.000000000000000
|  |        \--*  +         double
|  |           |  /--*  mathFN    double sqrt
|  |           |  |  \--*  dconst    double 9.0000000000000000
|  |           \--*  +         double
|  |              |  /--*  mathFN    double sqrt
|  |              |  |  \--*  dconst    double 8.0000000000000000
|  |              \--*  +         double
|  |                 |  /--*  mathFN    double sqrt
|  |                 |  |  \--*  dconst    double 7.0000000000000000
|  |                 \--*  +         double
|  |                    |  /--*  mathFN    double sqrt
|  |                    |  |  \--*  dconst    double 6.0000000000000000
|  |                    \--*  +         double
|  |                       |  /--*  mathFN    double sqrt
|  |                       |  |  \--*  dconst    double 5.0000000000000000
|  |                       \--*  +         double
|  |                          |  /--*  mathFN    double sqrt
|  |                          |  |  \--*  dconst    double 4.0000000000000000
|  |                          \--*  +         double
|  |                             |  /--*  mathFN    double sqrt
|  |                             |  |  \--*  dconst    double 3.0000000000000000
|  |                             \--*  +         double
|  |                                |  /--*  mathFN    double sqrt
|  |                                |  |  \--*  dconst    double 2.0000000000000000
|  |                                \--*  +         double
|  |                                   \--*  mathFN    double sqrt
|  |                                      \--*  dconst    double 1.0000000000000000
\--*  =         double
   \--*  lclVar    double V01 tmp0

RyuJIT marks the expression in Sqrt13 as not too big and don’t apply any optimization in this case. However, RyuJIT mark the expression in Sqrt14 as too big, save in a temp variable, and apply constant folding:

N001 [000001]   dconst    1.0000000000000000 => $c0 {DblCns[1.000000]}
N002 [000002]   mathFN    => $c0 {DblCns[1.000000]}
N003 [000003]   dconst    2.0000000000000000 => $c1 {DblCns[2.000000]}
N004 [000004]   mathFN    => $c2 {DblCns[1.414214]}
N005 [000005]   +         => $c3 {DblCns[2.414214]}
N006 [000006]   dconst    3.0000000000000000 => $c4 {DblCns[3.000000]}
N007 [000007]   mathFN    => $c5 {DblCns[1.732051]}
N008 [000008]   +         => $c6 {DblCns[4.146264]}
N009 [000009]   dconst    4.0000000000000000 => $c7 {DblCns[4.000000]}
N010 [000010]   mathFN    => $c1 {DblCns[2.000000]}
N011 [000011]   +         => $c8 {DblCns[6.146264]}
N012 [000012]   dconst    5.0000000000000000 => $c9 {DblCns[5.000000]}
N013 [000013]   mathFN    => $ca {DblCns[2.236068]}
N014 [000014]   +         => $cb {DblCns[8.382332]}
N015 [000015]   dconst    6.0000000000000000 => $cc {DblCns[6.000000]}
N016 [000016]   mathFN    => $cd {DblCns[2.449490]}
N017 [000017]   +         => $ce {DblCns[10.831822]}
N018 [000018]   dconst    7.0000000000000000 => $cf {DblCns[7.000000]}
N019 [000019]   mathFN    => $d0 {DblCns[2.645751]}
N020 [000020]   +         => $d1 {DblCns[13.477573]}
N021 [000021]   dconst    8.0000000000000000 => $d2 {DblCns[8.000000]}
N022 [000022]   mathFN    => $d3 {DblCns[2.828427]}
N023 [000023]   +         => $d4 {DblCns[16.306001]}
N024 [000024]   dconst    9.0000000000000000 => $d5 {DblCns[9.000000]}
N025 [000025]   mathFN    => $c4 {DblCns[3.000000]}
N026 [000026]   +         => $d6 {DblCns[19.306001]}
N027 [000027]   dconst    10.000000000000000 => $d7 {DblCns[10.000000]}
N028 [000028]   mathFN    => $d8 {DblCns[3.162278]}
N029 [000029]   +         => $d9 {DblCns[22.468278]}
N030 [000030]   dconst    11.000000000000000 => $da {DblCns[11.000000]}
N031 [000031]   mathFN    => $db {DblCns[3.316625]}
N032 [000032]   +         => $dc {DblCns[25.784903]}
N033 [000033]   dconst    12.000000000000000 => $dd {DblCns[12.000000]}
N034 [000034]   mathFN    => $de {DblCns[3.464102]}
N035 [000035]   +         => $df {DblCns[29.249005]}
N036 [000036]   dconst    13.000000000000000 => $e0 {DblCns[13.000000]}
N037 [000037]   mathFN    => $e1 {DblCns[3.605551]}
N038 [000038]   +         => $e2 {DblCns[32.854556]}
N039 [000041]   lclVar    V01 tmp0         d:2 => $e2 {DblCns[32.854556]}
N040 [000042]   =         => $e2 {DblCns[32.854556]}

This is very strange situation. I want to see RyuJIT could apply such optimization even for small expression. So, go to GitHub and create another issue: coreclr/issues/978. We can learn some new fact from the discussion: if we manually save the result of expression in a temp variable:

public static double Sqrt13B()
{
    double res = Math.Sqrt(1) + Math.Sqrt(2) + Math.Sqrt(3) + Math.Sqrt(4) + Math.Sqrt(5) + 
                 Math.Sqrt(6) + Math.Sqrt(7) + Math.Sqrt(8) + Math.Sqrt(9) + Math.Sqrt(10) + 
                 Math.Sqrt(11) + Math.Sqrt(12) + Math.Sqrt(13);
    return res;
}

then RyuJIT applies constant folding, the expression will be precalculated. As a result of the discussion, it was decided: RyuJIT shouldn’t behave like this. So, another ticked was created: coreclr/issues/987.