Today I tell you about one of my favorite benchmarks (this method doesn't return a useful value, we need it only as an example):
[Benchmark]
public string Sum()
{
double a = 1, b = 1;
var sw = new Stopwatch();
for (int i = 0; i < 10001; i++)
a = a + b;
return string.Format("{0}{1}", a, sw.ElapsedMilliseconds);
}
An interesting fact: if you call Stopwatch.GetTimestamp()
before the first call of the Sum
method, you improve Sum
performance several times (works only with LegacyJIT-x86).
Let's consider the following programs (platform target is x86
):
class ProgramA
{
static void Main()
{
Sum();
}
public static string Sum()
{
double a = 1, b = 1;
var sw = new Stopwatch();
for (int i = 0; i < 10001; i++)
a = a + b;
return string.Format("{0}{1}", a, sw.ElapsedMilliseconds);
}
}
class ProgramB
{
static void Main()
{
Stopwatch.GetTimestamp(); // !!!
Sum();
}
public static string Sum()
{
double a = 1, b = 1;
var sw = new Stopwatch();
for (int i = 0; i < 10001; i++)
a = a + b;
return string.Format("{0}{1}", a, sw.ElapsedMilliseconds);
}
}
The only difference between these programs is the Stopwatch.GetTimestamp()
call. Now, let's look at the asm code for the loop:
; ProgramA
; for (int i = 0; i < 10001; i++)
xor eax,eax
; a = a + b;
fld1
fadd qword ptr [ebp-14h]
fstp qword ptr [ebp-14h]
; ProgramB
; for (int i = 0; i < 10001; i++)
xor eax,eax
; a = a + b;
fld1
faddp st(1),st
It turns out, that in ProgramA
keeps data on the stack, ProgramB
keeps data in FPU registers.
In fact, in ProgramB
we can call Stopwatch.IsHighResolution
or Stopwatch.Frequency
instead of Stopwatch.GetTimestamp()
. The main things is that we want to implicitly call the static constructor of the Stopwatch
class. It affects how the call of the Stopwatch
instance constructor will be jitted:
; Program A
; var sw = new Stopwatch();
mov ecx,71CDF3D4h
call 005D30F4 ; basic ctor logic
mov ecx,5E5F60h ; !!! Here we should check
mov edx,4F6h ; !!! that static constructor
call 005D348C ; !!! has been called
; // inlined Stopwatch::.ctor body
mov dword ptr [esi+4],0 ; elapsed = 0
mov dword ptr [esi+8],0 ; elapsed = 0
mov byte ptr [esi+14h],0 ; isRunning = false
mov dword ptr [esi+0Ch],0 ; startTimeStamp = 0
mov dword ptr [esi+10h],0 ; startTimeStamp = 0
; Program B
; var sw = new Stopwatch();
mov ecx,71CDF3D4h
call 005D30F4 ; basic ctor logic
; // inlined Stopwatch::.ctor body
mov dword ptr [esi+4],0 ; elapsed = 0
mov dword ptr [esi+8],0 ; elapsed = 0
mov byte ptr [esi+14h],0 ; isRunning = false
mov dword ptr [esi+0Ch],0 ; startTimeStamp = 0
mov dword ptr [esi+10h],0 ; startTimeStamp = 0
As you can see, we have two call sites for ProgramA
and one call site for ProgramB
.
The FP enregistration logic for the LegacyJIT-x86 uses the number of call sites as a factor in choosing to enregister or not to enregister floating point locals. Thus, we have different asm code for ProgramA
and ProgramB
.
But should we care about it? How it affects method performance? Let's benchmark it! I wrote the following benchmark (based on BenchmarkDotNet v0.9.4):
[Config(typeof(Config))]
public class FirstCall
{
[Params(false, true)]
public bool CallTimestamp { get; set; }
[Setup]
public void Setup()
{
if (CallTimestamp)
Stopwatch.GetTimestamp();
}
[Benchmark]
public string Sum()
{
double a = 1, b = 1;
var sw = new Stopwatch();
for (int i = 0; i < 10001; i++)
a = a + b;
return string.Format("{0}{1}", a, sw.ElapsedMilliseconds);
}
private class Config : ManualConfig
{
public Config()
{
Add(Job.LegacyJitX86);
}
}
}
Results:
BenchmarkDotNet=v0.9.4.0
OS=Microsoft Windows NT 6.2.9200.0
Processor=Intel(R) Core(TM) i7-4810MQ CPU 2.80GHz, ProcessorCount=8
Frequency=2728072 ticks, Resolution=366.5592 ns, Timer=TSC
HostCLR=MS.NET 4.0.30319.42000, Arch=32-bit RELEASE
JitModules=clrjit-v4.6.1073.0
Type=FirstCall Mode=Throughput Platform=X86
Jit=LegacyJit
Method | Median | StdDev | CallTimestamp |
------- |----------- |---------- |-------------- |
Sum | 27.0464 us | 0.4958 us | False |
Sum | 8.3247 us | 0.0293 us | True |
A single call of Stopwatch.GetTimestamp()
before the first call of the Sum
method improved performance 3.5 times!
Sometimes, performance is tricky and benchmarking is super-tricky.
Stopwatch
method can spoil benchmark results.