I will give you a non-answer. Sorry, but as far as I know there is no good answer to this. RDTSC
will only work on certain CPUs under very specific conditions returning values whose interpretation is somewhere between hard and impossible without the help of the operating system, therefore I suspect no one has bothered to implement support for that in portable compilers/libraries (all other expect the Intel compiler).
Here's the long story:
The RDTSC
instruction has had a long history of semantic changes that are very hard to keep track of in an application. Older Intel and AMD CPUs only had the TSC count the internal cycles which meant that with variable frequency (power saving modes, etc.) the frequency could change without any notification to the application. The frequency could have changed multiple times between two timestamps and you had no way of knowing that this happened.
Some CPU or BIOS versions could suspend the TSC while in system management mode, while other didn't. The first behavior meant that TSC was useless for wall-clock time, the other meant that TSC was useless for benchmarking. Last time I was looking at this there was no way of detecting this other than comparing to a different clock and looking for large jumps.
Some CPUs didn't keep TSC and/or its frequency synchronized between multiple CPUs in the system. Which means that if the operating system moves your process between CPUs, the TSC value you read is in the best case totally useless and in the worst case subtly misleading.
Recent trend and stability promise has been to have a synchronized timer and synchronized static frequency (which you can't achieve because the clocks are sensitive to temperature, but that's another story). We can finally stably use RDTSC without problems.
But then Intel threw us another curveball by suddenly deciding that RDTSC
is no longer a serializing instruction (it's most likely not a conscious decision, it's probably just a mistake that Intel is getting away with by saying "it was never documented to be serializing"). This means that if you read the timer twice in your code, the second value can be lower than the first value. Or even worse, most of the code you're benchmarking hasn't actually been run. The new RDTSCP
instruction "solves" this problem, but you need to figure out which CPUs actually implement it, which ones have reliable enough RDTSC
that you can use, and which ones you just have to give up and use a better time source.
To add to this, you don't know if your code is actually running between two calls to RDTSC
or if you're context switched. Therefore I would suggest to stick to timing facilities that your operating system provides and measure the time that your process is running. Those timing facilities are slower, but the operating system has most likely solved all these problems much better than you'll ever be able to figure out. As a bonus if you're using NTP or some other time synchronizing mechanism you'll also get the clock frequencies much closer to real seconds because they also keep track of long and short term frequency drift that you as an application can not possibly know.