Time Stamp Counter (TSC)

TSC is a common abbreviation for Time Stamp Counter. It is an internal 64-bit register that is presented on all x86 processors since the Pentium. The TSC is an independent counter and cannot be affected by changes in the current system time. It provides a timestamp value expressed in ticks, that are supposed to monotonically increasing. The TSC frequency depends on the CPU model but is typically close to the nominal CPU frequency.

The value of the Time Stamp Counter can be read into EDX:EAX registers using the RDTSC instruction. The opcode for this instruction is 0F 31 (intel-manual, Vol. 2B 4-545). C# and other .NET languages are high-level, so we are not typically working directly with assembly opcodes (because we have the powerful base class library which contains managed wrappers for all useful functions). However, if you want to do it, there are some special tricks. For a better understanding of internals, we will learn how to get the value of the time stamp counter without standard .NET classes. On Windows, it can be read directly from C# code with the help of the following assembly injection:

const uint PAGE_EXECUTE_READWRITE = 0x40;
const uint MEM_COMMIT = 0x1000;

[DllImport("kernel32.dll", SetLastError = true)]
static extern IntPtr VirtualAlloc(IntPtr lpAddress,
                                  uint dwSize,
                                  uint flAllocationType,
                                  uint flProtect);

static IntPtr Alloc(byte[] asm)
{
  var ptr = VirtualAlloc(IntPtr.Zero,
                         (uint)asm.Length,
                         MEM_COMMIT,
                         PAGE_EXECUTE_READWRITE);
  Marshal.Copy(asm, 0, ptr, asm.Length);
  return ptr;
}

delegate long RdtscDelegate();

static readonly byte[] rdtscAsm =
{
  0x0F, 0x31, // RDTSC
  0xC3        // RET
};

static void Main()
{
  var rdtsc = Marshal
    .GetDelegateForFunctionPointer<RdtscDelegate>(Alloc(rdtscAsm));
  Console.WriteLine(rdtsc());
}

Let’s discuss this code in detail.

  • For an assembly injection, we need VirtualAlloc function from kernel32.dll. This function will help us to manually allocate memory in the virtual address space of the current process.
  • The Alloc function takes a byte array with assembly instruction opcodes, allocates memory with the help of VirtualAlloc, copies the opcodes there, and returns a pointer to the address of the allocated and filled memory chunk. The penultimate argument of VirtualAlloc (flAllocationType) is responsible for what we are going to do with this memory: MEM_COMMIT means that we are going to commit memory changes. The last argument of VirtualAlloc (flProtect) is responsible for the memory protection mode: PAGE_EXECUTE_READWRITE means that we can execute code directly from the allocated pages.
  • We define a signature for the newly managed rdtsc function via RdtscDelegate (it doesn’t have any arguments and returns a long value).
  • The rdtscAsm array contains all the target assembly opcodes: 0F 31 for RDTSC and C3 for RET.
  • The Main method uses Marshal.GetDelegateForFunctionPointer for converting an unmanaged function pointer to a delegate. The generic overload is supported only in the .NET Framework 4.5.1 and later versions. The argument of this method is Alloc(rdtscAsm): here we take the byte array with the assembly opcodes and transform it into a IntPtr which points to a piece of memory with these opcodes.

Such an approach allows calling RDTSC from the managed code. Usually, it’s not a good idea because there are a lot of troubles with the time stamp counter which can spoil your measurements (many of them will be covered soon). Operating systems have special APIs that allow getting high-precision timestamps without assembly injection and direct knowledge about TSC. These APIs protect you from problems that you can get with the direct RDTSC call. However, sometimes the described assembly injection can be useful for research and diagnostics.

If you want to read the TSC value directly via the RDTSC instruction, you should know that the processor can reorder your instructions and spoil your measurements. From intel-manual, Vol. 3B 17-41, section 17.15:

The RDTSC instruction is not serializing or ordered with other instructions. It does not necessarily wait until all previous instructions have been executed before reading the counter. Similarly, subsequent instructions may begin execution before the RDTSC instruction operation is performed.

We can find a classic way to solve this problem in agner-optimizing-assembly (Section 18.1):

On all processors with out-of-order execution, you have to insert XOR EAX,EAX/CPUID before and after each read of the counter to prevent it from executing in parallel with anything else. CPUID is a serializing instruction, which means that it flushes the pipeline and waits for all pending operations to finish before proceeding. This is very useful for testing purposes.

In agner-optimizing-cpp (Section 16 “Testing speed”), you can find a C++ example of a direct RDTSC call with a memory barrier via CPUID.

There is another timestamping native instruction that prevents instruction reordering: RDTSCP. It also reads the TSC value, but it waits until all previous instructions have been executed before reading the counter. From intel-manual, Vol. 2B 4-545:

If software requires RDTSC to be executed only after all previous instructions have completed locally, it can either use RDTSCP (if the processor supports that instruction) or execute the sequence LFENCE;RDTSC.

You can use RDTSCP instead of RDTSC and be not afraid of out-of-order execution. In addition to TSC reading, RDTSCP also reads the processor ID, but you don’t need it for time measurements.

Now let’s talk about the RDTSCP access time. In the below Table, you can see the list of reciprocal RDTSC throughputs (CPU clock cycles) for different processors (the data is taken from agner-instruction-tables).

Processor NameReciprocal throughput
AMD K711
AMD K87
AMD K1067
AMD Bulldozer42
AMD Pilediver42
AMD Steamroller78
AMD Bobcat87
AMD Jaguar41
Intel Pentium M, Core Solo, Core Duo42
Intel Pentium 480
Intel Pentium 4 w. EM64T (Prescott)100
Intel Core 2 (Merom)64
Intel Core 2 (Wolfdale)32
Intel Nehalem24
Intel Sandy Bridge28
Intel Ivy Bridge27
Intel Haswell24
Intel Broadwell24
Intel Skylake25
Intel SkylakeX25

How can we interpret these numbers? Let’s say that we have Intel Haswell (our reciprocal throughput is 24) with fixed CPU frequency = 2.2G Hz. So, 1 CPU clock cycle is about 0.45 ns (it’s our resolution). We can say that a RDTSC invocation takes approximately $24 \times 0.45\textrm{ns} \approx 10.8\textrm{ns}$ (for RDTSC, we can assume that latency is approximately equal to reciprocal throughput).

You can also evaluate the throughput of RDTSC on your machine. Download testp.zip from the Agner Fog site, build it, and run misc_int.sh1. Here are typical results for Intel Haswell:

rdtsc Throughput

Processor 0
Clock Core cyc Instruct Uops uop p0 uop p1 uop p2
 1686     2384      100 1500    255    399      0
 1686     2384      100 1500    255    399      0
 1686     2384      100 1500    255    399      0
 1686     2384      100 1500    254    399      0
 1686     2384      100 1500    255    399      0
 1686     2384      100 1500    255    399      0
 1686     2384      100 1500    255    399      0
 1686     2384      100 1500    254    399      0
 1686     2384      100 1500    255    399      0
 1686     2384      100 1500    255    399      0

Here we have 2384 CPU cycles per 100 RDTSC instructions which means approximately 24 cycles per instruction.

On modern hardware and operating systems, TSC works very well, but it has a long history (see stackoverflow-19941588), and people often consider TSC an unreliable source of timestamps. Let’s discuss different generations of TSC and the problems that we could get with it (you can find more information about it in intel-manual, Vol. 3B 17-40, section 17.16).

Generation 1: Variant TSC

The first version of TSC (see the list of the processor’s families in intel-manual, Vol. 3B 17-40, section 17.16) was very simple: it just counted internal processor clock cycles. It’s not a good way to measure time on modern hardware because the processor can dynamically change its own frequency (e.g., the SpeedStep and Turbo Boost technologies by Intel).

There is another problem: each processor core has its own TSC, and these TSCs are not synchronized. If a thread starts a measurement on one core and ends on another core, the obtained result can’t be reliable. For example, there is a nice bug report on support.microsoft.com (see [@MSSupport895980]); the author had the following output for the ping command:

C:\>ping x.x.x.x

Pinging x.x.x.x with 32 bytes of data:

Reply from x.x.x.x: bytes=32 time=-59ms TTL=128
Reply from x.x.x.x: bytes=32 time=-59ms TTL=128
Reply from x.x.x.x: bytes=32 time=-59ms TTL=128
Reply from x.x.x.x: bytes=32 time=-59ms TTL=128

The cause:

This problem occurs when the computer has the AMD Cool’n’Quiet technology (AMD dual cores) enabled in the BIOS or some Intel multi core processors. Multi core or multiprocessor systems may encounter Time Stamp Counter (TSC) drift when the time between different cores is not synchronized. The operating systems which use TSC as a timekeeping resource may experience the issue.

If you want to use TSC on old hardware/software, it’s a good idea to set the processor affinity of your thread or process. If you are working with native code, you can do it via SetThreadAffinityMask on Windows, sched_setaffinity on Linux. In managed C# code, you can use the ProcessorAffinity property of the process like this:

IntPtr affinityMask = (IntPtr) 0x0002; // Second core only
Process.GetCurrentProcess().ProcessorAffinity = affinityMask;

Fortunately, we don’t have these problems on modern hardware because the TSC internals were significantly improved.

Generation 2: Constant TSC

Constant TSC is the next generation of TSC which solves the dynamic frequency problem: this kind of TSC increments at a constant rate. It’s a good step forward, but Constant TSC still has some issues (e.g., it could be stopped when the CPU runs into deep C-state, read more in [@Kidd2014]). These problems were solved in the next reincarnation of TSC.

Generation 3: Invariant TSC

Invariant TSC is the latest version of the counter which works well. A quote from intel-manual:

The invariant TSC will run at a constant rate in all ACPI P-, C-. and T-states. This is the architectural behavior moving forward. On processors with invariant TSC support, the OS may use the TSC for wall clock timer services (instead of ACPI or HPET timers).

You can check which kind of TSC you have with the help of the CPUID opcode. Fortunately, you shouldn’t write another assembly injection for that because there are existing tools that can detect the TSC kind. On Windows, you can check it via the Coreinfo utility (a part of the Sysinternals Suite):

Here is a partial output example with TSC-specific lines:

Coreinfo v3.31 - Dump information on system CPU and memory topology
Copyright (C) 2008-2014 Mark Russinovich
Sysinternals - www.sysinternals.com
Intel(R) Core(TM) i7-6700HQ CPU @ 2.60GHz
Intel64 Family 6 Model 94 Stepping 3, GenuineIntel
RDTSCP          *       Supports RDTSCP instruction
TSC             *       Supports RDTSC instruction
TSC-INVARIANT   *       TSC runs at constant rate

It tells us that both RDTSC and RDTSCP are supported and the invariant TSC is available. You can do the same thing on Linux with the following command:

$ cat /proc/cpuinfo | tr ' ' '\n' | sort -u | grep -i "tsc"

If RDTSC, RDTSCP, and the invariant TSC are available, you should have the following lines in the output:

constant_tsc
nonstop_tsc
rdtscp
tsc

The invariant TSC is indicated by a combination of constant_tsc (synchronization between cores) and nonstop_tsc (power management independence) flags.

In most cases, you can trust Invariant TSC and use it as a wall clock timer for high-precision measurements. In rare cases, you can still have some problems (like synchronization problems on large multi-processor systems), but you typically shouldn’t worry about it. Nowadays, Invariant TSC is a very popular TSC kind, you can find it in most of the modern Intel processors.

Now we know some basic information about different generations of TSC, assembly instructions for getting counter values, how to call it from the managed C# code, and what kind of problem we have with the TSC. But there are also other hardware timers.