Stopwatch under the hood

by Andrey Akinshin · 2016-09-09

Update: You can find an updated and significantly improved version of this post in my book “Pro .NET Benchmarking”.

In the previous post, we discussed DateTime. This structure can be used in situations when you don’t need a good level of precision. If you want to do high-precision time measurements, you need a better tool because DateTime has a small resolution and a big latency. Also, time is tricky, you can create wonderful bugs if you don’t understand how it works (see Falsehoods programmers believe about time and More falsehoods programmers believe about time).

In this post, we will briefly talk about the Stopwatch class:

  • Which kind of hardware timers could be a base for Stopwatch
  • High precision timestamp API on Windows and Linux
  • Latency and Resolution of Stopwatch in different environments
  • Common pitfalls: which kind of problems could we get trying to measure small time intervals

If you are not a .NET developer, you can also find a lot of useful information in this post: mainly we will discuss low-level details of high-resolution timestamping (probably your favorite language also uses the same API). As usual, you can also find useful links for further reading.


TSC

TSCTime Stamp Counter. It is an internal 64-bit register present on all x86 processors since the Pentium. Can be read into EDX:EAX using the instruction RDTSC.

You can fine a lot of useful information in Intel: Intel® 64 and IA-32 Architectures Software Developer’s Manual. In this section, I will often refer to this manual and will call it just “the manual”.

Opcode for RDTSC is 0F 31 (the manual, Vol. 2B 4-545). On Windows, it can be read directly from C# code with help of the following asm injection:

const uint PAGE_EXECUTE_READWRITE = 0x40;
const uint MEM_COMMIT = 0x1000;

[DllImport("kernel32.dll", SetLastError = true)]
static extern IntPtr VirtualAlloc(IntPtr lpAddress, uint dwSize, 
                                  uint flAllocationType, uint flProtect);

static IntPtr Alloc(byte[] asm)
{
    var ptr = VirtualAlloc(IntPtr.Zero, (uint)asm.Length, 
                           MEM_COMMIT, PAGE_EXECUTE_READWRITE);
    Marshal.Copy(asm, 0, ptr, asm.Length);
    return ptr;
}

delegate long RdtscDelegate();

static readonly byte[] rdtscAsm =
{
    0x0F, 0x31, // rdtsc
    0xC3        // ret
};

static void Main()
{
    var rdtsc = Marshal.GetDelegateForFunctionPointer<RdtscDelegate>(Alloc(rdtscAsm));
    Console.WriteLine(rdtsc());
}

On modern hardware and modern operation systems, TSC works well, but it has a long history and people often consider TSC as an unreliable source of timestamps. Let’s discuss different generations of TSC and problems which we could get with TSC (you can find more information about it in the manual, Vol. 3B 17-40, section 17.15).

Variant TSC

The first version of TSC (see the list of the processors families in the manual, Vol. 3B 17-40, section 17.15) was very simple: it just counts internal processor clock cycle.

It’s not a good way to measure time on modern hardware because the processor can dynamically change own frequency (e.g. see SpeedStep).

There is another problem: each processor core has own TSC, and these TSCs are not synchronized. If a thread starts measurement on one core and ends on another core, the obtained result can’t be reliable. For example, there is a nice bug report on support.microsoft.com (see Programs that use the QueryPerformanceCounter function may perform poorly):

C:\>ping x.x.x.x

Pinging x.x.x.x with 32 bytes of data:

Reply from x.x.x.x: bytes=32 time=-59ms TTL=128
Reply from x.x.x.x: bytes=32 time=-59ms TTL=128
Reply from x.x.x.x: bytes=32 time=-59ms TTL=128
Reply from x.x.x.x: bytes=32 time=-59ms TTL=128

The cause:

This problem occurs when the computer has the AMD Cool’n’Quiet technology (AMD dual cores) enabled in the BIOS or some Intel multi core processors. Multi core or multiprocessor systems may encounter Time Stamp Counter (TSC) drift when the time between different cores is not synchronized. The operating systems which use TSC as a timekeeping resource may experience the issue.

If you want to use TSC on old hardware/software, you should probably set processor affinity for you thread (see SetThreadAffinityMask for Windows, sched_setaffinity for Linux).

Constant TSC

Constant TSC is the next generation of TSC which solves the dynamic frequency problem: this kind of TSC increments at a constant rate. It’s a good step forward, but Constant TSC still has some issues (e.g. it could be stopped when CPU run into deep C-state, see also Power Management States: P-States, C-States, and Package C-States).

Invariant TSC

Invariant TSC is the latest version of the counter which works well. From the manual:

The invariant TSC will run at a constant rate in all ACPI P-, C-. and T-states. This is the architectural behavior moving forward. On processors with invariant TSC support, the OS may use the TSC for wall clock timer services (instead of ACPI or HPET timers).

You can check which kind of TSC do you have with the help of the CPUID opcode. For example, processors support for invariant TSC is indicated by CPUID.80000007H:EDX[8] (the manual, Vol. 2A 3-190, Table 3-17).

On Windows, you can also check it via the Coreinfo utility:

> Coreinfo.exe | grep -i "tsc"

The output on my laptop:

Coreinfo v3.31 - Dump information on system CPU and memory topology
Copyright (C) 2008-2014 Mark Russinovich
Sysinternals - www.sysinternals.com
RDTSCP          *       Supports RDTSCP instruction
TSC             *       Supports RDTSC instruction
TSC-DEADLINE    *       Local APIC supports one-shot deadline timer
TSC-INVARIANT   *       TSC runs at constant rate

(You can read more about TSC-DEADLINE in the same manual, Vol. 3A 10-17, section 10.5.4.1)

You can do the same thing on Linux with the following command:

$ cat /proc/cpuinfo | tr ' ' '\n' | sort -u | grep -i "tsc"

Output on my laptop:

constant_tsc
nonstop_tsc
rdtscp
tsc
tsc_adjust
tsc_deadline_timer

Invariant TSC is specified by a combination of constant_tsc and nonstop_tsc flags.

In the most cases, you can trust Invariant TSC and use it for high-precision measurements (however, there are still some problems, e.g. synchronization problems on large multi-processor systems).

TSC and out-of-order execution

There is another interesting fact which you should consider if you want to read the TSC value directly via the RDTSC instruction: processor can reorder your instruction and spoil your measurements.

From the manual, Vol. 3B 17-41, section 17.15:

The RDTSC instruction is not serializing or ordered with other instructions. It does not necessarily wait until all previous instructions have been executed before reading the counter. Similarly, subsequent instructions may begin execution before the RDTSC instruction operation is performed.

From Optimizing subroutines in assembly language by Agner Fog (section 18.1):

On all processors with out-of-order execution, you have to insert XOR EAX,EAX/CPUID before and after each read of the counter to prevent it from executing in parallel with anything else. CPUID is a serializing instruction, which means that it flushes the pipeline and waits for all pending operations to finish before proceeding. This is very useful for testing purposes.

So, we can’t just call RDTSC and be sure that there is no out-of-order execution here. How can we call it the right way? Here is a C++ example by Agner Fog (see Optimizing software in C++. An optimization guide for Windows, Linux and Mac platforms, section 16 “Testing speed”):

// Example 16.1

## nclude <intrin.h>         // Or #include <ia32intrin.h> etc.

long long ReadTSC() {       // Returns time stamp counter
    int dummy[4];           // For unused returns
    volatile int DontSkip;  // Volatile to prevent optimizing
    long long clock;        // Time
    __cpuid(dummy, 0);      // Serialize
    DontSkip = dummy[0];    // Prevent optimizing away cpuid
    clock = __rdtsc();      // Read time
    return clock;
}

There is another interesting instruction: RDTSCP, it reads time stamp counter and processor ID (see the manual, Vol. 2B 4-547):

Reads the current value of the processor’s time-stamp counter (a 64-bit MSR) into the EDX:EAX registers and also reads the value of the IA32_TSC_AUX MSR (address C0000103H) into the ECX register.

The RDTSCP instruction waits until all previous instructions have been executed before reading the counter.

From the manual, Vol. 2B 4-545:

If software requires RDTSC to be executed only after all previous instructions have completed locally, it can either use RDTSCP (if the processor supports that instruction) or execute the sequence LFENCE;RDTSC.

Thus, you can use RDTSCP instead of RDTSC (if your hardware supports this instruction) and not to be afraid of out-of-order execution.

Latency and resolution

Here is a list of reciprocal throughputs (CPU clock cycles) of RDTSC from Instruction tables by Agner Fog (2016-01-09) for different processors:

Processor NameReciprocal throughput
AMD K711
AMD K87
AMD K1067
AMD Bulldozer42
AMD Pilediver42
AMD Steamroller78
AMD Bobcat87
AMD Jaguar41
Intel Pentium II/III42
Intel Core 2 (Merom)64
Intel Core 2 (Wolfdale)32
Intel Nehalem24
Intel Sandy Bridge28
Intel Ivy Bridge27
Intel Haswell24
Intel Broadwell24
Intel Skylake25
Intel Pentium 480
Intel Pentium 4 w. EM64T (Prescott)100
VIA Nano 2000 series39
VIA Nano 3000 series37

How can we interpret these numbers? Let’s say that we have Intel Haswell (our reciprocal throughput is 24) with fixed CPU frequency = 2.2GHz. So, 1 CPU clock cycle is about 0.45ns (it’s our resolution). We can say that single RDTSC invocation takes approximately 24 x 0.45ns ≈ 10.8ns (for RDTSC we can assume that latency is approximate equals to reciprocal throughput).

You can also evaluate throughput of RDTSC on your machine. Download testp.zip from the Anger Fog site, build it, and run misc_int.sh1. Here are results on my laptop (Intel Haswell):

rdtsc Throughput

Processor 0
     Clock   Core cyc   Instruct       Uops     uop p0     uop p1     uop p2 
      1686       2384        100       1500        255        399          0 
      1686       2384        100       1500        255        399          0 
      1686       2384        100       1500        255        399          0 
      1686       2384        100       1500        254        399          0 
      1686       2384        100       1500        255        399          0 
      1686       2384        100       1500        255        399          0 
      1686       2384        100       1500        255        399          0 
      1686       2384        100       1500        254        399          0 
      1686       2384        100       1500        255        399          0 
      1686       2384        100       1500        255        399          0 

Here we have 2384 CPU cycles per 100 RDTSC instructions which mean approximately 24 cycles per instruction.

Summary

As we can see, TSC has a high resolution and low latency. However, you don’t want to use it in general because there are a lot of problems with TSC. Here is a brief summary:

  • Some old processors don’t have TSC registers.
  • The processor can change the frequency and affect the old version of TSC.
  • There are synchronization problems on multi-core systems.
  • Even if we have Invariant TSC, there are still synchronization problems on large multi-processor systems.
  • Some processors can execute RDTSC out of order.

There is another detailed problems summary in MSDN: Acquiring high-resolution time stamps, “TSC Register” section. Also, you can find a nice problems overview in this article: Pitfalls of TSC usage.

Thus, TSC is not a good choice for time measurements in general case because you can’t be sure in advance that it produces reliable measurements. Fortunately, modern operation systems provide nice API which allows getting the most reliable timestamps for current hardware.


ACPI PM and HPET

ACPI PM

ACPI is Advanced Configuration and Power Interface. It defines a power management timer that provides accurate time values. By specification (v6.0, April 2015), frequency of the Power Management Timer should be 3.579545 MHz (see section 4.8.2.1):

The power management timer is a 24-bit or 32-bit fixed rate free running count-up timer that runs off a 3.579545 MHz clock. The ACPI OS checks the FADT to determine whether the PM Timer is a 32-bit or 24-bit timer. The programming model for the PM Timer consists of event logic and a read port to the counter value. The event logic consists of an event status and enables bit. The status bit is set any time the last bit of the timer (bit 23 or bit 31) goes from set to clear or clear to set. If the TMR_EN bit is set, then the setting of the TMR_STS will generate an ACPI event in the PM1_EVT register grouping (referred to as PMTMR_PME in the diagram). The event logic is only used to emulate a larger timer.

But why do we have exactly 3.579545 MHz (which equals to 5×7×9/(8×11) MHz)? Historically, this number comes from The National Television System Committee (NTSC), here is a nice explanation from Wikipedia:

In January 1950, the Committee was reconstituted to standardize color television. In December 1953, it unanimously approved what is now called the NTSC color television standard (later defined as RS-170a). The “compatible color” standard retained full backward compatibility with existing black-and-white television sets. Color information was added to the black-and-white image by introducing a color subcarrier of precisely 3.579545 MHz (nominally 3.58 MHz). The precise frequency was chosen so that horizontal line-rate modulation components of the chrominance signal would fall exactly in between the horizontal line-rate modulation components of the luminance signal, thereby enabling the chrominance signal to be filtered out of the luminance signal with minor degradation of the luminance signal. Due to limitations of frequency divider circuits at the time the color standard was promulgated, the color subcarrier frequency was constructed as composite frequency assembled from small integers, in this case, 5×7×9/(8×11) MHz. The horizontal line rate was reduced to approximately 15,734 lines per second (3.579545×2/455 MHz) from 15,750 lines per second, and the frame rate was reduced to approximately 29.970 frames per second (the horizontal line rate divided by 525 lines/frame) from 30 frames per second. These changes amounted to 0.1 percent and were readily tolerated by existing television receivers.

HPET

From Wikipedia:

The High Precision Event Timer (HPET) is a hardware timer used in personal computers. It was developed jointly by AMD and Microsoft and has been incorporated in PC chipsets since circa 2005.

According to (IA-PC HPET Specification Rev 1.0a, section 2.2), minimum HPET clock frequency is 10 MHz. However, default HPET frequency is 14.31818 MHz or 4x the ACPI clock (it allows to use the same crystal oscillator in HPET and ACPI PM, see also wiki/Colorburst#Crystals).

On Windows you can enable or disable HPET with the help of the following commands:

:: Enable HPET (reboot is required): 
bcdedit /set useplatformclock true
:: Disable HPET (reboot is required):
bcdedit /deletevalue useplatformclock
:: View all Windows Boot Manager/Loader values:
bcdedit /enum

There are some useful commands on Linux:


## Get available clocksource:

$ cat /sys/devices/system/clocksource/clocksource0/available_clocksource
tsc hpet acpi_pm

## Get current clocksource:

$ cat /sys/devices/system/clocksource/clocksource0/current_clocksource 
tsc

## Set current clocksource:

$ sudo /bin/sh -c 'echo hpet > /sys/devices/system/clocksource/clocksource0/current_clocksource'

Usually, HPET is disabled by default on modern hardware because of large latency (see the Benchmarks section).


Windows

The best article about time stamps on Windows is Acquiring high-resolution time stamps. Brief summary:

On Windows, the primary API for high-resolution time stamps is QueryPerformanceCounter (QPC). For device drivers, the kernel-mode API is KeQueryPerformanceCounter. If you need high-resolution time-of-day measurements, use GetSystemTimePreciseAsFileTime (available since Windows 8 / Windows Server 2012).

QPC is completely independent of the system time and UTC (it is not affected by daylight savings time, leap seconds, time zones). It is also not affected by processor frequency changes. Thus, it is th best option, if you want to measure duration of an operation. If you want to know high-precision DateTime, use GetSystemTimePreciseAsFileTime.

  • QPC is available on Windows XP and Windows 2000 and works well on most systems. However, some hardware systems BIOS did not indicate the hardware CPU characteristics correctly (a non-invariant TSC), and some multi-core or multi-processor systems used processors with TSCs that could not be synchronized across cores. Systems with flawed firmware that run these versions of Windows might not provide the same QPC reading on different cores if they used the TSC as the basis for QPC.
  • All computers that shipped with Windows Vista and Windows Server 2008 used the HPET or the ACPI PM as the basis for QPC.
  • The majority of Windows 7 and Windows Server 2008 R2 computers have processors with constant-rate TSCs and use these counters as the basis for QPC.
  • Windows 8, Windows 8.1, Windows Server 2012, and Windows Server 2012 R2 use TSCs as the basis for the performance counter.

There are two main functions for high-resolution time stamps in kernel32.dll:

[DllImport("kernel32.dll")]
private static extern bool QueryPerformanceCounter(out long value);

[DllImport("kernel32.dll")]
private static extern bool QueryPerformanceFrequency(out long value);

Thus, we can get tick counter via QueryPerformanceCounter. But how does it work? Let’s write a simple program:

static void Main(string[] args)
{
    long ticks;
    QueryPerformanceCounter(out ticks);
}

[DllImport("kernel32.dll")]
private static extern bool QueryPerformanceCounter(out long value);

build it (Release-x64) and open the executable in WinDbg. There is a difference between x86 and x64 asm code, but x64 asm code will be enough to understand what’s going on. Let’s go to to the KERNEL32!QueryPerformanceCounter (we even don’t need sos.dll here):

> bp KERNEL32!QueryPerformanceCounter
> g
KERNEL32!QueryPerformanceCounter:
00007ffe6ccbb720  jmp     qword ptr [KERNEL32!QuirkIsEnabled2Worker+0x9ec8 (00007ffe6cd16378)] 
                                    ds:00007ffe6cd16378={ntdll!RtlQueryPerformanceCounter (00007ffe6d83a7b0)}

If you are not able to set a breakpoint to KERNEL32!QueryPerformanceCounter, you can try to use KERNEL32!QueryPerformanceCounterStub (I have observed both situations on Windows 10):

> bp KERNEL32!QueryPerformanceCounterStub
> g
KERNEL32!QueryPerformanceCounterStub:
00007fff431f5750 jmp      qword ptr [KERNEL32!_imp_QueryPerformanceCounter (00007fff`43255290)]
                                    ds:00007fff43255290={ntdll!RtlQueryPerformanceCounter (00007fff`45300ff0)}

KERNEL32!QueryPerformanceCounter (or KERNEL32!QueryPerformanceCounterStub) just redirects us to ntdll!RtlQueryPerformanceCounter. Let’s look at the disassembly of this method:

> uf ntdll!RtlQueryPerformanceCounter
ntdll!RtlQueryPerformanceCounter:
00007ffe6d83a7b0  push    rbx
00007ffe6d83a7b2  sub     rsp,20h
00007ffe6d83a7b6  mov     al,byte ptr [SharedUserData+0x3c6 (000000007ffe03c6)]
00007ffe6d83a7bd  mov     rbx,rcx
00007ffe6d83a7c0  cmp     al,1
00007ffe6d83a7c2  jne     ntdll!RtlQueryPerformanceCounter+0x44 (00007ffe6d83a7f4)

ntdll!RtlQueryPerformanceCounter+0x14:
00007ffe6d83a7c4  mov     rcx,qword ptr [SharedUserData+0x3b8 (000000007ffe03b8)]
00007ffe6d83a7cc  rdtsc
00007ffe6d83a7ce  shl     rdx,20h
00007ffe6d83a7d2  or      rax,rdx
00007ffe6d83a7d5  mov     qword ptr [rbx],rax
00007ffe6d83a7d8  lea     rdx,[rax+rcx]
00007ffe6d83a7dc  mov     cl,byte ptr [SharedUserData+0x3c7 (000000007ffe03c7)]
00007ffe6d83a7e3  shr     rdx,cl
00007ffe6d83a7e6  mov     qword ptr [rbx],rdx

ntdll!RtlQueryPerformanceCounter+0x39:
00007ffe6d83a7e9  mov     eax,1
00007ffe6d83a7ee  add     rsp,20h
00007ffe6d83a7f2  pop     rbx
00007ffe6d83a7f3  ret

ntdll!RtlQueryPerformanceCounter+0x44:
00007ffe6d83a7f4  lea     rdx,[rsp+40h]
00007ffe6d83a7f9  lea     rcx,[rsp+38h]
00007ffe6d83a7fe  call    ntdll!NtQueryPerformanceCounter (00007ffe6d8956f0)
00007ffe6d83a803  mov     rax,qword ptr [rsp+38h]
00007ffe6d83a808  mov     qword ptr [rbx],rax
00007ffe6d83a80b  jmp     ntdll!RtlQueryPerformanceCounter+0x39 (00007ffe6d83a7e9)

I will try to explain the situation in simple terms. There is a special flag in [SharedUserData+0x3c6 (000000007ffe03c6)] that determines which QPC algorithm we will use. If everything is fine (we are working on modern hardware with invariant TSC and we can directly use it), we are going to the fast algorithm (ntdll!RtlQueryPerformanceCounter+0x14). Otherwise, we are going to call ntdll!NtQueryPerformanceCounter which produces a syscall:

> uf ntdll!NtQueryPerformanceCounter
ntdll!NtQueryPerformanceCounter:
00007ffe6d8956f0  mov     r10,rcx
00007ffe6d8956f3  mov     eax,31h
00007ffe6d8956f8  test    byte ptr [SharedUserData+0x308 (000000007ffe0308)],1
00007ffe6d895700  jne     ntdll!NtQueryPerformanceCounter+0x15 (00007ffe6d895705)

ntdll!NtQueryPerformanceCounter+0x12:
00007ffe6d895702  syscall
00007ffe6d895704  ret

ntdll!NtQueryPerformanceCounter+0x15:
00007ffe6d895705  int     2Eh
00007ffe6d895707  ret

An important fact about the fast algorithm (ntdll!RtlQueryPerformanceCounter+0x14): it directly calls rdtsc without any syscalls. It allows achieving low latency for simple situations (when we really can use TSC without any troubles).

Another interesting fact: QPC use shifted value of rdtsc: after it puts full value of the counter in rdx, it performs shr rdx,cl (where cl typically equals to 0xA). Thus, one QPC tick is 1024 rdtsc ticks. We can say the same thing about QPF: nominal Windows frequency for high-precision measurements is 1024 times less than rdtsc frequency.


BenchmarkDotNet=v0.9.9.0
OS=Microsoft Windows NT 6.2.9200.0 (Windows 10 anniversary update)
Processor=Intel(R) Core(TM) i7-4702MQ CPU 2.20GHz, ProcessorCount=8
Frequency=2143473 ticks, Resolution=466.5326 ns, Timer=TSC
CLR1=CORE, Arch=64-bit ? [RyuJIT]
CLR2=MS.NET 4.0.30319.42000
CLR3=Mono JIT compiler version 4.2.3
JitModules=clrjit-v4.6.1586.0
dotnet cli version: 1.0.0-preview2-003121

TSC:

MethodRuntimeMedianStdDev
LatencyClr17.1918 ns0.2723 ns
ResolutionClr475.1302 ns7.4886 ns
LatencyCore15.6241 ns0.3809 ns
ResolutionCore467.9744 ns7.1308 ns
LatencyMono40.4701 ns0.5814 ns
ResolutionMono475.2795 ns4.4417 ns

HPET:

MethodRuntimeMedianStdDev
LatencyClr598.5198 ns78.7814 ns
LatencyCore603.3178 ns18.0963 ns
LatencyMono737.9618 ns4.5637 ns

Linux

On Linux, there are many different time functions: time(), clock(), clock_gettime(), getrusage(), gettimeofday(), mach_absolute_time() (you can find a nice overview here). Both CoreCLR and Mono uses clock_getttime as a primary way (with fallbacks to mach_absolute_time and gettimeofday) because it’s the best way to get high-precision time stamp. This function has the following signature:

int clock_gettime(clockid_t clk_id, struct timespec *tp);

Here clockid_t is ID of the target clock. For high precision timestamping, we should use CLOCK_MONOTONIC (if this option is available on current hardware) but there are other clock option (like CLOCK_REALTIME for real-time clock or CLOCK_THREAD_CPUTIME_ID for thread-specific CPU-time clock). The target timestamp will be returned as a timespec:

struct timespec {
    time_t tv_sec; /* seconds */
    long tv_nsec;  /* nanoseconds */ };

A usage example:

struct timespec ts;
uint64_t timestamp;
clock_gettime(CLOCK_MONOTONIC, &ts);
timestamp = (static_cast<uint64_t>(ts.tv_sec) * 1000000000) + static_cast<uint64_t>(ts.tv_nsec);

Thus, minimal possible resolution of clock_gettime is 1 ns. Internally, clock_gettime(CLOCK_MONOTONIC, ...) is based on the current high precision hardware timer (usually TSC, but it can also be HPET or ACPI_PM).

To reduce clock_gettime latency, Linux kernel uses the vsyscalls (virtual system calls) and VDSOs (Virtual Dynamically linked Shared Objects) instead of direct syscall (you can find some useful implementation details here and here).

If Invariant TSC is available, clock_gettime(CLOCK_MONOTONIC, ...) will use it directly via the rdtsc instruction. Of course, it adds some overhead, but in general case, you should use clock_gettime instead of rdtsc because it solves a lot of portability problems. For example, see this nice commit: x86: tsc prevent time going backward.


Xubuntu 16.04.01, the same hardware:

BenchmarkDotNet=v0.9.9.0
OS=Unix 4.4.0.34
Processor=Intel(R) Core(TM) i7-4702MQ CPU 2.20GHz, ProcessorCount=8
CLR1=CORE, Arch=64-bit ? [RyuJIT]
CLR2=Mono 4.4.2 (Stable 4.4.2.11/f72fe45 Fri Jul 29 09:58:49 UTC 2016), Arch=64-bit RELEASE
dotnet cli version: 1.0.0-preview2-003121

TSC:

MethodRuntimeMedianStdDev
LatencyCore31.9727 ns1.8880 ns
LatencyMono22.4036 ns1.3299 ns
ResolutionMono97.6713 ns0.0971 ns

ACPI_PM:

MethodRuntimeMedianStdDev
LatencyCore683.2088 ns16.5555 ns
LatencyMono653.4289 ns1.5812 ns

HPET:

MethodRuntimeMedianStdDev
LatencyCore536.0179 ns7.4501 ns
LatencyMono526.4265 ns1.8113 ns

Full .NET Framework

Stopwatch on the Full .NET Framework simply uses QPC/QPF, let’s look at the source code.

Stopwatch.cs

public class Stopwatch
{
    private const long TicksPerMillisecond = 10000;
    private const long TicksPerSecond = TicksPerMillisecond * 1000;

    // "Frequency" stores the frequency of the high-resolution performance counter, 
    // if one exists. Otherwise it will store TicksPerSecond. 
    // The frequency cannot change while the system is running,
    // so we only need to initialize it once. 
    public static readonly long Frequency;
    public static readonly bool IsHighResolution;

    // performance-counter frequency, in counts per ticks.
    // This can speed up conversion from high frequency performance-counter 
    // to ticks. 
    private static readonly double tickFrequency;

    static Stopwatch()
    {
        bool succeeded = SafeNativeMethods.QueryPerformanceFrequency(out Frequency);
        if (!succeeded)
        {
            IsHighResolution = false;
            Frequency = TicksPerSecond;
            tickFrequency = 1;
        }
        else
        {
            IsHighResolution = true;
            tickFrequency = TicksPerSecond;
            tickFrequency /= Frequency;
        }
    }

    public static long GetTimestamp()
    {
        if (IsHighResolution)
        {
            long timestamp = 0;
            SafeNativeMethods.QueryPerformanceCounter(out timestamp);
            return timestamp;
        }
        else
        {
            return DateTime.UtcNow.Ticks;
        }
    }
}

SafeNativeMethods.cs

[DllImport(ExternDll.Kernel32)]
[ResourceExposure(ResourceScope.None)]
public static extern bool QueryPerformanceCounter(out long value);

[DllImport(ExternDll.Kernel32)]
[ResourceExposure(ResourceScope.None)]
public static extern bool QueryPerformanceFrequency(out long value);

What do you think, is it possible to get a negative elapsed time interval using Stopwatch? There was a bug until .NET 3.5 because of which you can observe such situation. The bug was fixed in .NET 4.0 (originally with #if NET_4_0..#endif; without these directives since .NET 4.5). Here is a comment with explanation:

if (_elapsed < 0)
{
    // When measuring small time periods the StopWatch.Elapsed* 
    // properties can return negative values.  This is due to 
    // bugs in the basic input/output system (BIOS) or the hardware
    // abstraction layer (HAL) on machines with variable-speed CPUs
    // (e.g. Intel SpeedStep).

    _elapsed = 0;
}

CoreCLR

Basically, CoreCLR 1.0.0 contains almost the same Stopwatch.cs as in a case of the Full .NET Framework except for some minor changes in private variable names, SecuritySafeCritical attribute usages, and FEATURE_NETCORE depended on code. The main difference is the following: CoreCLR uses another declaration of QPC and QPF methods depend on the target platform (instead of using fixed SafeNativeMethods.cs implementation).

Windows

Stopwatch.Windows.cs

public partial class Stopwatch
{
    private static bool QueryPerformanceFrequency(out long value)
    {
        return Interop.mincore.QueryPerformanceFrequency(out value);
    }

    private static bool QueryPerformanceCounter(out long value)
    {
        return Interop.mincore.QueryPerformanceCounter(out value);
    }
}

Interop/Windows/mincore/Interop.QueryPerformanceCounter.cs

internal partial class mincore
{
    [DllImport(Libraries.Profile)]
    internal static extern bool QueryPerformanceCounter(out long value);
}

Interop/Windows/mincore/Interop.QueryPerformanceFrequency.cs

internal partial class mincore
{
    [DllImport(Libraries.Profile)]
    internal static extern bool QueryPerformanceFrequency(out long value);
}

Thus, on Windows, we just call usual QPC/QPF API.

Unix

Stopwatch.Unix.cs

public partial class Stopwatch
{
    private static bool QueryPerformanceFrequency(out long value)
    {
        return Interop.Sys.GetTimestampResolution(out value);
    }

    private static bool QueryPerformanceCounter(out long value)
    {
        return Interop.Sys.GetTimestamp(out value);
    }
}

Interop/Unix/System.Native/Interop.GetTimestamp.cs

internal static partial class Sys
{
    [DllImport(Libraries.SystemNative, EntryPoint = "SystemNative_GetTimestampResolution")]
    internal static extern bool GetTimestampResolution(out long resolution);

    [DllImport(Libraries.SystemNative, EntryPoint = "SystemNative_GetTimestamp")]
    internal static extern bool GetTimestamp(out long timestamp);
}

Native/System.Native/pal_time.cpp

extern "C" int32_t SystemNative_GetTimestampResolution(uint64_t* resolution)
{
    assert(resolution);

## f HAVE_CLOCK_MONOTONIC

// Make sure we can call clock_gettime with MONOTONIC.  Stopwatch invokes
    // GetTimestampResolution as the very first thing, and by calling this here
    // to verify we can successfully, we don't have to branch in GetTimestamp.
    struct timespec ts;
    if (clock_gettime(CLOCK_MONOTONIC, &ts) == 0) 
    {
        *resolution = SecondsToNanoSeconds;
        return 1;
    }
    else
    {
        *resolution = 0;
        return 0;
    }

    struct timespec ts;
    int result = clock_gettime(CLOCK_MONOTONIC, &ts);
    assert(result == 0); // only possible errors are if MONOTONIC isn't supported or &ts is an invalid address
    (void)result; // suppress unused parameter warning in release builds
    *timestamp = (static_cast<uint64_t>(ts.tv_sec) * SecondsToNanoSeconds) + static_cast<uint64_t>(ts.tv_nsec);
    return 1;

## lif HAVE_MACH_ABSOLUTE_TIME

mach_timebase_info_data_t mtid;
    if (mach_timebase_info(&mtid) == KERN_SUCCESS)
    {
        *resolution = SecondsToNanoSeconds * (static_cast<uint64_t>(mtid.denom) / static_cast<uint64_t>(mtid.numer));
        return 1;
    }
    else
    {
        *resolution = 0;
        return 0;
    }

    *timestamp = mach_absolute_time();
    return 1;

## lse /* gettimeofday */

*resolution = SecondsToMicroSeconds;
    return 1;

## ndif

}
extern "C" int32_t SystemNative_GetTimestamp(uint64_t* timestamp)
{
    assert(timestamp);

}

As you can see, primarily we are trying to use clock_gettime (in the HAVE_CLOCK_MONOTONIC mode), otherwise we are trying to use mach_absolute_time (in the HAVE_MACH_ABSOLUTE_TIME mode), otherwise we are using gettimeofday (with corresponded conversions).


if (gettimeofday (&tv, NULL) == 0)
    return ((gint64)tv.tv_sec * 1000000 + tv.tv_usec) * 10;
return 0;

}


As you can see, the algorithms Mono internally uses the same API as API in CoreCLR:
  `QPC`/`clock_gettime` as a primary way,
  `mono_100ns_datetime`/`gettimeofday` as a fallback case.
   
Is it possible to get a negative elapsed interval on Mono?
It's theoretically possible on the old version on Mono and old hardware.
This bug was [fixed](https://github.com/mono/mono/commit/dbc021772a8c0a8bf97615523c73d55cf9b376c3) by me and [merged](https://github.com/mono/mono/commit/226af94a2345f88d3170823646e1c25a276ba281) into the master (Sep 23, 2015).

---

## lse

struct timeval tv;
    if (gettimeofday(&tv, NULL) == 0)
    {
        *timestamp = (static_cast<uint64_t>(tv.tv_sec) * SecondsToMicroSeconds) + static_cast<uint64_t>(tv.tv_usec);
        return 1;
    }
    else
    {
        *timestamp = 0;
        return 0;
    }

## Mono

Let's look at an implementation of `Stopwatch` in Mono (we will work with `mono-4.4.2.11` in this post).

[`Stopwatch.cs`](https://github.com/mono/mono/blob/mono-4.4.2.11/mcs/class/System/System.Diagnostics/Stopwatch.cs)
```cs
public class Stopwatch
{
    [MethodImplAttribute(MethodImplOptions.InternalCall)]
    public static extern long GetTimestamp();

    public static readonly long Frequency = 10000000;

    public static readonly bool IsHighResolution = true;
}

As you can see, the frequency of Stopwatch in Mono is a const (10'000'000). For some historical reasons, this Stopwatch uses the same tick value (100ns) as the DateTime.

Now let’s look at the internal implementation.

icall-def.h

ICALL_TYPE(STOPWATCH, "System.Diagnostics.Stopwatch", STOPWATCH_1)
ICALL(STOPWATCH_1, "GetTimestamp", mono_100ns_ticks)

mono-time.c#L30

/* Returns the number of 100ns ticks from unspecified time: this should be monotonic */
gint64
mono_100ns_ticks (void)
{
    static LARGE_INTEGER freq;
    static UINT64 start_time;
    UINT64 cur_time;
    LARGE_INTEGER value;

    if (!freq.QuadPart) {
        if (!QueryPerformanceFrequency (&freq))
            return mono_100ns_datetime ();
        QueryPerformanceCounter (&value);
        start_time = value.QuadPart;
    }
    QueryPerformanceCounter (&value);
    cur_time = value.QuadPart;
    /* we use unsigned numbers and return the difference to avoid overflows */
    return (cur_time - start_time) * (double)MTICKS_PER_SEC / freq.QuadPart;
}

mono-time.c#L128

/* Returns the number of 100ns ticks from unspecified time: this should be monotonic */
gint64
mono_100ns_ticks (void)
{
    struct timeval tv;

## fdef CLOCK_MONOTONIC

struct timespec tspec;
    static struct timespec tspec_freq = {0};
    static int can_use_clock = 0;
    if (!tspec_freq.tv_nsec) {
        can_use_clock = clock_getres (CLOCK_MONOTONIC, &tspec_freq) == 0;
        /*printf ("resolution: %lu.%lu\n", tspec_freq.tv_sec, tspec_freq.tv_nsec);*/
    }
    if (can_use_clock) {
        if (clock_gettime (CLOCK_MONOTONIC, &tspec) == 0) {
            /*printf ("time: %lu.%lu\n", tspec.tv_sec, tspec.tv_nsec); */
            return ((gint64)tspec.tv_sec * MTICKS_PER_SEC + tspec.tv_nsec / 100);
        }
    }

## lif defined(PLATFORM_MACOSX)

/* http://developer.apple.com/library/mac/#qa/qa1398/_index.html */
    static mach_timebase_info_data_t timebase;
    guint64 now = mach_absolute_time ();
    if (timebase.denom == 0) {
        mach_timebase_info (&timebase);
        timebase.denom *= 100; /* we return 100ns ticks */
    }
    return now * timebase.numer / timebase.denom;

## Small time intervals

When you are trying to measure time of an operation, it's really important to understand order of this measurement and order of `Stopwatch` resolution and latency
  (it will be covered in details in the [Benchmarks](#benchmarks) section).
If your operations take 1 second and you both latency and resolution are less than `1 μs`, everything is fine.
But if you are trying to measure an operation which takes `10 ns` and `Stopwatch` Resolution is about `300-400ns`, you will have some problems.
Of course, you can repeat the target operation several times, but there are still many microbenchmark problems (which [BenchmarkDotNet](https://github.com/PerfDotNet/BenchmarkDotNet) trying to solve).
Be careful in such situations; it's really easy to make wrong measurements of short operations.

## Sequential reads

Let's say that we do two sequential reads of `Stopwatch.GetTimestamp()`:

```cs
var a = Stopwatch.GetTimestamp();
var b = Stopwatch.GetTimestamp();
var delta = b - a;

Can you say possible values of delta? Let’s check it with help of the following program:

const int N = 1000000;
var values = new long[N];
for (int i = 0; i < N; i++)
    values[i] = Stopwatch.GetTimestamp();
var deltas = new long[N - 1];
for (int i = 0; i < N - 1; i++)
    deltas[i] = values[i + 1] - values[i];
var table =
    from d in deltas
    group d by d into g
    orderby g.Key
    select new
    {
        Ticks = g.Key,
        Microseconds = g.Key * 1000000.0 / Stopwatch.Frequency,
        Count = g.Count()
    };
Console.WriteLine("Ticks | Time(μs) | Count   ");
Console.WriteLine("------|----------|---------");
foreach (var line in table)
{
    var ticks = line.Ticks.ToString().PadRight(5);
    var us = line.Microseconds.ToString("0.0").PadRight(8);
    var count = line.Count.ToString();
    Console.WriteLine($"{ticks} | {μs} | {count}");
}

Here is a typical output on my laptop:

Ticks | Time(μs) | Count
------|----------|---------
0     | 0.0      | 931768
1     | 0.4      | 66462
2     | 0.7      | 1155
3     | 1.1      | 319
4     | 1.5      | 59
5     | 1.8      | 21
6     | 2.2      | 46
7     | 2.6      | 39
8     | 2.9      | 31
9     | 3.3      | 10
10    | 3.7      | 6
11    | 4.0      | 3
13    | 4.8      | 2
15    | 5.5      | 3
16    | 5.9      | 1
18    | 6.6      | 1
19    | 7.0      | 2
20    | 7.3      | 1
22    | 8.1      | 1
23    | 8.4      | 4
24    | 8.8      | 8
25    | 9.2      | 12
27    | 9.9      | 13
28    | 10.3     | 10
30    | 11.0     | 6
31    | 11.4     | 1
34    | 12.5     | 4
35    | 12.8     | 1
36    | 13.2     | 1
44    | 16.1     | 1
47    | 17.2     | 1
56    | 20.5     | 1
66    | 24.2     | 1
68    | 24.9     | 1
71    | 26.0     | 1
123   | 45.1     | 1
314   | 115.1    | 1
2412  | 884.1    | 1

As you can see, once I had a delta between two sequential GetTimestamp which equals to 2412 ticks or 0.8ms (my current Stopwatch.Frequency = 2728068)! This fact is not obvious for most developers. There are two popular kinds of mistakes here.

  • Handwritten benchmarks. Here is a popular pattern for time measurements:
var sw = Stopwatch.StartNew();
// Target method
sw.Stop();
var time = sw.Elapsed;

This code usually works fine for target method that takes seconds and works awfully for method that takes milliseconds. Let’s write another check program:

var maxTime = 0.0;
for (int i = 0; i < 10000000; i++)
{
    var sw = Stopwatch.StartNew();
    sw.Stop();
    maxTime = Math.Max(maxTime, sw.Elapsed.TotalMilliseconds);
}
Console.WriteLine(maxTime + " ms");

Usually, I get maxTime is about 1ms but sometimes it equals to 6–7ms. But we even don’t have any target method here; we are trying to measure nothing! Of course, it is a rare situation. Usually, you get plausible measurements. But you can never be sure! Besides, it is methodologically wrong. Please, don’t write such benchmarks.

  • Two GetTimestamp in an expression.

Can you say where is a bug in the following expression?

  stopwatch.ElapsedMilliseconds > timeout ? 0 : timeout - (int)stopwatch.ElapsedMilliseconds

The answer: we can’t be sure that two invocations of stopwatch.ElapsedMilliseconds will return the same value. For example, let’s say that timeout equals to 100. We are trying to evaluate stopwatch.ElapsedMilliseconds > timeout, stopwatch.ElapsedMilliseconds returns 99, the expression value is false. Next, we are going to evaluate timeout - (int)stopwatch.ElapsedMilliseconds. But we have another stopwatch.ElapsedMilliseconds here! Let’s say, it returns 101. Then, result value will be equal to -1! Probably, the author of this code did not expect negative values here.

You can say that nobody writes such code. Or that such code never is a cause of a bug in the real application. But it can. This expression was taken from the AsyncIO library (this code was already fixed). Once I really had a bug because this value was negative! By the way, such kind of bugs is really hard to reproduce. So, just don’t write code like this.

Reciprocal frequency and actual resolution

A lot of developers think that these values are equal. But they can be not equal in the general case. In fact, we have only one guarantee here: actual resolution ≥ reciprocal frequency because we can’t achieve resolution better than one tick. However, there are a lot of cases when one tick defined by Stopwatch.Frequency does not correspond to actual resolution (for example, on Mono or on Hypervisors). But what if we really want to know the actual resolution? Unfortunately, there is no API for that. But we can calculate an approximation of this value. For example, if we take the table from the Sequential reads section and remove the first line (where ticks=0), we get an observable resolution distribution for an interval of time. We can calculate a minimal observable resolution and average/median observable resolution. But be careful: keep in mind that it’s just observable values for a single series of measurements. You can get a new distribution per each run. You can not trust these values, but sometimes you can use them for some resolution approximation (may be useful in some applications).


Benchmarks

Let’s write the following benchmark with the help of BenchmarkDotNet:

[ClrJob, CoreJob, MonoJob]
public class StopwatchBenchmarks
{
    [Benchmark]
    public long Latency() => Stopwatch.GetTimestamp();

    [Benchmark]
    public long Resolution()
    {
        long lastTimestamp = Stopwatch.GetTimestamp();
        while (Stopwatch.GetTimestamp() == lastTimestamp)
        {
        }
        return lastTimestamp;
    }
}

Be careful: the Resolution methods does not produce correct value of Resolution when Latency is much bigger than actual resolution (see the Interpretation subsection). We will omit results for such situations.

Interpretation

Ok, we got a lot of interesting numbers, but we should interpret it in the right way. The latency benchmark looks great; it produced a nice approximation of real timestamp latency.

But there are some troubles with the resolution benchmark because it’s hard to design correct microbenchmark in this situation.

It works well only for cases where Latency is much smaller than resolution: for example, it produces believable numbers for Windows+TSC. In this case we get Resolution ≈ (1 second / Stopwatch.Frequency) ≈ (1 second / (rdstc Frequency / 1024)).

In the case of HPET/ACPI_PM benchmarks show that Resolution ≈ 2 x Latency because we call Stopwatch.GetTimestamp at least twice per the Resolution method invocation. It’s hard to say something about real resolution because the value of HPET/ACPI_PM ticks is much smaller than the latency. For practical use, you can assume that the Resolution has the same order as Latency.

Now let’s consider the Linux+TSC case. On Mono, we have Resolution = 100 ns because it is the value of 1 tick (and it can be achieved). On CoreCLR, 1 ticks is 1 ns, and it use rdtsc which works on frequency = ≈2.20GHz. Thus, we have a situation which is similar to the HPET/ACPI_PM case: latency is much bigger than resolution. So, it’s hard to evaluate it via a microbenchmark. Let’s use the program from the Pitfalls section (where we played with sequential reads). We will use ns instead of μs and run it on Linux+TSC+CoreCLR. Result on my laptop (the middle part was cut):

Ticks | Time(ns) | Count   
------|----------|---------
22    | 22.0     | 2
23    | 23.0     | 485
24    | 24.0     | 1919
25    | 25.0     | 7516
26    | 26.0     | 7683
27    | 27.0     | 825
28    | 28.0     | 8
30    | 30.0     | 2
31    | 31.0     | 2296
32    | 32.0     | 20916
33    | 33.0     | 52925
34    | 34.0     | 172174
35    | 35.0     | 110923
36    | 36.0     | 100979
37    | 37.0     | 322173
38    | 38.0     | 159638
39    | 39.0     | 9510
40    | 40.0     | 2270
41    | 41.0     | 808
42    | 42.0     | 993
43    | 43.0     | 2154
44    | 44.0     | 1114
45    | 45.0     | 2551
46    | 46.0     | 3234
47    | 47.0     | 5828
48    | 48.0     | 4342
49    | 49.0     | 3695
50    | 50.0     | 162
...
807   | 807.0    | 5
809   | 809.0    | 1
810   | 810.0    | 4
811   | 811.0    | 8
812   | 812.0    | 19
813   | 813.0    | 22
814   | 814.0    | 17
815   | 815.0    | 22
816   | 816.0    | 10
817   | 817.0    | 17
818   | 818.0    | 20
819   | 819.0    | 20
820   | 820.0    | 9
821   | 821.0    | 29
822   | 822.0    | 23
823   | 823.0    | 31
824   | 824.0    | 22
825   | 825.0    | 14
826   | 826.0    | 19
827   | 827.0    | 16
828   | 828.0    | 20
829   | 829.0    | 27
830   | 830.0    | 29
831   | 831.0    | 29
832   | 832.0    | 37
833   | 833.0    | 21
834   | 834.0    | 10
835   | 835.0    | 7
836   | 836.0    | 11
837   | 837.0    | 8
838   | 838.0    | 10
839   | 839.0    | 4
840   | 840.0    | 11
841   | 841.0    | 6
842   | 842.0    | 7
843   | 843.0    | 7
844   | 844.0    | 5
845   | 845.0    | 4
846   | 846.0    | 3
...
32808 | 32808.0  | 1
110366 | 110366.0 | 1
112423 | 112423.0 | 1

The minimal resolution in this experiment is 22 ns but it was achieved only twice (N = 1000000). Once we had two sequential timestamp invocation with difference in 112423 ticks (or ≈112 μs). So, we can’t say that there is a specific value of resolution in this case, but we can say that the resolution has the same order as latency.


Summary

This was a brief overview of the Stopwatch class. It is hard to cover all the details of Stopwatch behavior because there is a lot of different combinations of runtimes / operation systems / hardware. When we are using Stopwatch, usually we care about its latency and resolution because these values determine what we can measure with Stopwatch and what we can’t. There are some possible configurations:

RuntimeOSHardware Timer1 tickLatencyResolution
Full .NETWindowsTSC300-400ns15-18ns300-400ns
Full .NETWindowsHPET69.8ns0.5-0.8us≈Latency
Full .NETWindowsNA100ns7-10ns0.5-55ms
MonoWindowsTSC100ns35-45ns300-400ns
MonoWindowsHPET100ns0.5-0.8us≈Latency
MonoWindowsNA100ns30-40ns0.5-55ms
CoreCLRLinuxTSC1ns30-35ns≈Latency
CoreCLRLinuxHPET/ACPI_PM1ns0.5-0.8us≈Latency
MonoLinuxTSC100ns20-25ns100ns
MonoLinuxHPET/ACPI_PM100ns0.5-0.8us≈Latency
  • Here Hardware Timer = NA means that there is no high-precision timer in the system, i.e. Stopwatch.IsHighResolution = false.
  • 1 tick is calculated based on nominal frequency (1 second / Stopwatch.Frequency).
  • Resolution = ≈Latency means that the actual Resolution is less than the Latency, the observed Resolution has the same order as the Latency.

Note that it’s only examples, you can get different values in different situations. This tables just shows that sometimes it’s not easy to guess target latency and resolution, there are a lot of things which can affect these values. Anyway, you should be careful, if you want to measure time intervals. Especially, if you want to measure small time intervals: it’s really easy to make a mistake and get wrong results. If you want to know more about high-resolution time measurements, there are a lot of links in the next section.

As you can see, the topic is very hard because there are a lot of different environments in the modern world. If I missed something important or wrote something wrong, feel free to leave any feedback.