LegacyJIT-x86 and first method call
Today I tell you about one of my favorite benchmarks (this method doesn’t return a useful value, we need it only as an example):
[Benchmark]
public string Sum()
{
double a = 1, b = 1;
var sw = new Stopwatch();
for (int i = 0; i < 10001; i++)
a = a + b;
return string.Format("{0}{1}", a, sw.ElapsedMilliseconds);
}
An interesting fact: if you call Stopwatch.GetTimestamp()
before the first call of the Sum
method, you improve Sum
performance several times (works only with LegacyJIT-x86).
Source code and ASM
Let’s consider the following programs (platform target is x86
):
class ProgramA
{
static void Main()
{
Sum();
}
public static string Sum()
{
double a = 1, b = 1;
var sw = new Stopwatch();
for (int i = 0; i < 10001; i++)
a = a + b;
return string.Format("{0}{1}", a, sw.ElapsedMilliseconds);
}
}
class ProgramB
{
static void Main()
{
Stopwatch.GetTimestamp(); // !!!
Sum();
}
public static string Sum()
{
double a = 1, b = 1;
var sw = new Stopwatch();
for (int i = 0; i < 10001; i++)
a = a + b;
return string.Format("{0}{1}", a, sw.ElapsedMilliseconds);
}
}
The only difference between these programs is the Stopwatch.GetTimestamp()
call. Now, let’s look at the asm code for the loop:
; ProgramA
; for (int i = 0; i < 10001; i++)
xor eax,eax
; a = a + b;
fld1
fadd qword ptr [ebp-14h]
fstp qword ptr [ebp-14h]
; ProgramB
; for (int i = 0; i < 10001; i++)
xor eax,eax
; a = a + b;
fld1
faddp st(1),st
It turns out, that in ProgramA
keeps data on the stack, ProgramB
keeps data in FPU registers.
How so?
In fact, in ProgramB
we can call Stopwatch.IsHighResolution
or Stopwatch.Frequency
instead of Stopwatch.GetTimestamp()
. The main things is that we want to implicitly call the static constructor of the Stopwatch
class. It affects how the call of the Stopwatch
instance constructor will be jitted:
; Program A
; var sw = new Stopwatch();
mov ecx,71CDF3D4h
call 005D30F4 ; basic ctor logic
mov ecx,5E5F60h ; !!! Here we should check
mov edx,4F6h ; !!! that static constructor
call 005D348C ; !!! has been called
; // inlined Stopwatch::.ctor body
mov dword ptr [esi+4],0 ; elapsed = 0
mov dword ptr [esi+8],0 ; elapsed = 0
mov byte ptr [esi+14h],0 ; isRunning = false
mov dword ptr [esi+0Ch],0 ; startTimeStamp = 0
mov dword ptr [esi+10h],0 ; startTimeStamp = 0
; Program B
; var sw = new Stopwatch();
mov ecx,71CDF3D4h
call 005D30F4 ; basic ctor logic
; // inlined Stopwatch::.ctor body
mov dword ptr [esi+4],0 ; elapsed = 0
mov dword ptr [esi+8],0 ; elapsed = 0
mov byte ptr [esi+14h],0 ; isRunning = false
mov dword ptr [esi+0Ch],0 ; startTimeStamp = 0
mov dword ptr [esi+10h],0 ; startTimeStamp = 0
As you can see, we have two call sites for ProgramA
and one call site for ProgramB
.
The FP enregistration logic for the LegacyJIT-x86 uses the number of call sites as a factor in choosing to enregister or not to enregister floating point locals. Thus, we have different asm code for ProgramA
and ProgramB
.
Benchmarks
But should we care about it? How it affects method performance? Let’s benchmark it! I wrote the following benchmark (based on BenchmarkDotNet v0.9.4):
[Config(typeof(Config))]
public class FirstCall
{
[Params(false, true)]
public bool CallTimestamp { get; set; }
[Setup]
public void Setup()
{
if (CallTimestamp)
Stopwatch.GetTimestamp();
}
[Benchmark]
public string Sum()
{
double a = 1, b = 1;
var sw = new Stopwatch();
for (int i = 0; i < 10001; i++)
a = a + b;
return string.Format("{0}{1}", a, sw.ElapsedMilliseconds);
}
private class Config : ManualConfig
{
public Config()
{
Add(Job.LegacyJitX86);
}
}
}
Results:
BenchmarkDotNet=v0.9.4.0
OS=Microsoft Windows NT 6.2.9200.0
Processor=Intel(R) Core(TM) i7-4810MQ CPU 2.80GHz, ProcessorCount=8
Frequency=2728072 ticks, Resolution=366.5592 ns, Timer=TSC
HostCLR=MS.NET 4.0.30319.42000, Arch=32-bit RELEASE
JitModules=clrjit-v4.6.1073.0
Type=FirstCall Mode=Throughput Platform=X86
Jit=LegacyJit
Method | Median | StdDev | CallTimestamp |
------- |----------- |---------- |-------------- |
Sum | 27.0464 us | 0.4958 us | False |
Sum | 8.3247 us | 0.0293 us | True |
A single call of Stopwatch.GetTimestamp()
before the first call of the Sum
method improved performance 3.5 times!
Conclusion
Sometimes, performance is tricky and benchmarking is super-tricky.
- In general case, you can’t just take a method without a context and start to discuss about its performance because method jitted code can depend on CLR state at the moment of the first method call. (However, in practice, this is a rare situation).
- Your benchmark methods can affect each other (not only because of static constructors; e.g., self-tuned GC and interface method dispatching are important). Thus, it is a good practice to run each benchmark method in a separated process (default behaviour of BenchmarkDotNet) .
- It is really easy to make a mistake in a handwritten benchmark. In the above example, a careless call of a
Stopwatch
method can spoil benchmark results.