24

For my project I must use inline assembly instructions such as rdtsc to calculate the execution time of some C/C++ instructions.

The following code seems to work on Intel but not on ARM processors:

{unsigned a, d;asm volatile("rdtsc" : "=a" (a), "=d" (d)); t0 = ((unsigned long)a) | (((unsigned long)d) << 32);}
//The C++ statement to measure its execution time
{unsigned a, d;asm volatile("rdtsc" : "=a" (a), "=d" (d)); t1 = ((unsigned long)a) | (((unsigned long)d) << 32);}
time = t1-t0;

My question is:

How to write an inline assembly code similar to the above (to calculate the execution elapsed time of an instruction) to work on ARM processors?

Curious
  • 373
  • 1
  • 2
  • 8
  • 4
    `rdtsc` on multi-core processors can have issues. see https://msdn.microsoft.com/en-us/library/ee417693(VS.85).aspx – Richard Critten Nov 06 '16 at 20:33
  • Single instructions will have variable timings based on cache etc. Better to loop thousands of times over it/them and use the perf_events() common functionality to make it work on all supported CPUs. – BitBank Nov 06 '16 at 21:39

2 Answers2

20

You should read the PMCCNTR register of a co-processor p15 (not an actual co-processor, just an entry point for CPU functions) to obtain a cycle count. Note that it is available to an unprivileged app only if:

  1. Unprivileged PMCCNTR reads are alowed:

    Bit 0 of PMUSERENR register must be set to 1 (official docs)

  2. PMCCNTR is actually counting cycles:

    Bit 31 of PMCNTENSET register must be set to 1 (official docs)

This is a real-world example of how it`s done.

Community
  • 1
  • 1
hidefromkgb
  • 5,834
  • 1
  • 13
  • 44
  • @Curious Note that the answer above is valid for ARMv6 and above. Older arch versions might have their own methods of getting this data (specific to a partcular chip - so the info is to be found in the datasheet for the chip), while some ARM-based chips don't provide such data at all. – tum_ Nov 06 '16 at 22:25
  • 1
    **My ARM CPU is ARM7A**, confirmed that by using the compiler Macro__ARM_ARCH_7A__, however, when I try to use the instruction asm volatile("mrc p15, 0, **%0**, c9, c13, 0" : "=r"(pmccntr));, the compiler gives the error message: Error “no such instruction” asm volatile("mrc p15, 0, **%eax**, c9, c13, 0" : "=r"(pmccntr)); – Curious Nov 11 '16 at 03:07
  • **My Build Environment=** PLATFORM_VERSION_CODENAME=REL PLATFORM_VERSION=4.3 TARGET_PRODUCT=full_manta TARGET_BUILD_VARIANT=eng TARGET_BUILD_TYPE=release TARGET_BUILD_APPS= TARGET_ARCH=arm TARGET_ARCH_VARIANT=armv7-a-neon TARGET_CPU_VARIANT=cortex-a15 HOST_ARCH=x86 HOST_OS=linux HOST_OS_EXTRA=Linux-3.16.0-70-generic-x86_64-with-Ubuntu-14.04-trusty HOST_BUILD_TYPE=release BUILD_ID=JWR66V OUT_DIR=out – Curious Nov 11 '16 at 03:17
  • @hidefromkgb: This is the code that I used but it gives the above error. {uint32_t pmccntr;asm volatile("mrc p15, 0, %0, c9, c13, 0" : "=r"(pmccntr));t0=static_cast(pmccntr) * 64;} //The C++ statement to measure its execution time {uint32_t pmccntr;asm volatile("mrc p15, 0, %0, c9, c13, 0" : "=r"(pmccntr));t1=static_cast(pmccntr) * 64;} time = t1-t0; – Curious Nov 11 '16 at 03:39
  • 1
    @Curious You are, most probably, using the wrong binutils, since `as` is definitely trying to assemble X86 instead of ARM7A. And, BTW, `* 64` is equivalent to `<< 6`, and the result does not have to be either promoted to `uint64_t` or multiplied until (T1 – T0) is calculated. As the difference is typically way smaller than 2²⁶, multiplying it to 64 won\`t require promotion to a 64-bit type. – hidefromkgb Nov 11 '16 at 13:03
  • @hidefromkgb: I followed the [AOSP guidelines](https://source.android.com/source/building.html) and I did not change anything in the binutils. How to force the `as` to assemble ARM7A instead of X86? – Curious Nov 12 '16 at 10:19
  • @Curious AOSP is by itself useless in your case: it does not allow anything but Java as a language in which apps can be written. And that\`s for a good reason, as there are many different hardware architectures that support Android, so compiling machine code for all of them is a pain — and still you\`d leave out those which aren\`t yet supported. What you really need is [Android NDK](https://developer.android.com/ndk/index.html). NDK is positioned as a last-resort kit intended for programmers who positively do know what they are doing. – hidefromkgb Nov 12 '16 at 22:19
  • 1
    The URL provided is really helpful. Thanks. – Vyacheslav Jan 11 '21 at 21:02
10

For Arm64, the system register CNTVCT_EL0 can be used to retrieve the counter from user space.

// SPDX-License-Identifier: GPL-2.0
u64 rdtsc(void)
{
    u64 val;

    /*
     * According to ARM DDI 0487F.c, from Armv8.0 to Armv8.5 inclusive, the
     * system counter is at least 56 bits wide; from Armv8.6, the counter
     * must be 64 bits wide.  So the system counter could be less than 64
     * bits wide and it is attributed with the flag 'cap_user_time_short'
     * is true.
     */
    asm volatile("mrs %0, cntvct_el0" : "=r" (val));

    return val;
}

Please refer this patch https://lore.kernel.org/patchwork/patch/1305380/ for more details.

axiqia
  • 302
  • 4
  • 13
  • Do you think it's appropriate to relicense that GPL 2.0 code from the Linux kernel as CC BY-SA by posting it on StackOverflow? – Jeff Hammond Aug 05 '21 at 16:55
  • 1
    @JeffHammond Thank you for point it. I added the GPL 2.0 license. – axiqia Aug 06 '21 at 06:35
  • @JeffHammond: can you put a license on a sequence of two assembler instructions? – Violet Giraffe Jan 20 '23 at 20:50
  • Does the GPL include a minimum number of things that are copied before it applies? – Jeff Hammond Jan 21 '23 at 21:26
  • 1
    I tried this code, but it didn't work for me. It doesn't make sense that a sequence of code with 2304 multiplications execute in 30 cycles. – Bogi Jun 21 '23 at 07:29
  • It is not working for me either. I tried to approximate the CPU frequency in a 2 GHz aarch64 processor. `s = rdtsc(); sleep(1); e = rdtsc(); freq = (double)(e - s) / 10e9`. This code is reporting 20 MHz. – Ashfaqur Rahaman Jul 12 '23 at 08:12