0

In theory the cost of double-word addition/subtraction is taken 2 times of a single-word. Similarly, the cost ratio of single-word multiplication to addition is taken as 3. I have written the following C program using GCC on Ubuntu LTS 14.04 to check the number of clock cycles on my machine, Intel Sandy Bridge Corei5-2410M. Although, most of the time the program returns 6 clock cycles for 128-bit addition but I have taken the best-case. I compiled using the command (gcc -o ow -O3 cost.c) and the result is given below

32-bit Add: Clock cycles = 1    64-bit Add: Clock cycles = 1    64-bit Mult: Clock cycles = 2   128-bit Add: Clock cycles = 5 

The program is as follows:

#define n 500
#define counter 50000

typedef uint64_t utype64;
typedef int64_t type64;
typedef __int128 type128;

__inline__ utype64 rdtsc() {
        uint32_t lo, hi;
        __asm__ __volatile__ ("xorl %%eax,%%eax \n        cpuid"::: "%rax", "%rbx", "%rcx", "%rdx");
        __asm__ __volatile__ ("rdtsc" : "=a" (lo), "=d" (hi));
        return (utype64)hi << 32 | lo;
}

int main(){
    utype64 start, end;
    type64 a[n], b[n], c[n];
    type128 d[n], e[n], f[n];
    int g[n], h[n];
    unsigned short i, j;
    srand(time(NULL));
    for(i=0;i<n;i++){ g[i]=rand(); h[i]=rand(); b[i]=(rand()+2294967295); e[i]=(type128)(rand()+2294967295)*(rand()+2294967295);}
    for(j=0;j<counter;j++){
       start=rdtsc();
       for(i=0;i<n;i++){ a[i]=(type64)g[i]+h[i]; }
       end=rdtsc();
       if((j+1)%5000 == 0)
          printf("%lu-bit Add: Clock cycles = %lu \t", sizeof(g[0])*8, (end-start)/n);

       start=rdtsc();
       for(i=0;i<n;i++){ c[i]=a[i]+b[i]; }
       end=rdtsc();
       if((j+1)%5000 == 0)
          printf("%lu-bit Add: Clock cycles = %lu \t", sizeof(a[0])*8, (end-start)/n);

       start=rdtsc();
       for(i=0;i<n;i++){ d[i]=(type128)c[i]*b[i]; }
       end=rdtsc();
       if((j+1)%5000 == 0)
          printf("%lu-bit Mult: Clock cycles = %lu \t", sizeof(c[0])*8, (end-start)/n);

       start=rdtsc();
       for(i=0;i<n;i++){ f[i]=d[i]+e[i]; }
       end=rdtsc();
       if((j+1)%5000 == 0){
          printf("%lu-bit Add: Clock cycles = %lu \n", sizeof(d[0])*8, (end-start)/n);
        printf("f[%hu]= %ld %ld \n\n", i-7, (type64)(f[i-7]>>64), (type64)(f[i-7]));}
   }

return 0;
}

There are two things in the result that bothers me.

1) Can the number of clock cycles for (64-bit) multiplication become 2?

2) Why the number of clock cycles for double-word addition is more than 2 times of the single-word addition?

I am mainly concerned for case (2). Now, the question arises that is it because of my program logic? Or Is it due to GCC compiler optimization?

user110219
  • 153
  • 10

1 Answers1

3

In theory we know that the double-word addition/subtraction takes 2 times of a single-word.

No, we don't.

Similarly, the cost ratio of single-word multiplication to addition is taken as 3 because of fast integer multiplier of CPU.

No, it isn't.

You're not measuring instructions. You're measuring statements in your program. Which may or may not have any relationship with the instructions your compiler will emit. My compiler for example, after fixing your code so that it compiles, vectorized some of the loops. Adding multiple values per instruction. The first loop itself is still 23 instructions long and is still reported as 1 cycle by your code.

Modern (as in past 25 years) CPUs don't execute one instruction at a time. They'll have multiple instructions in flight at once and can execute them out of order.

Then you have memory accesses. On your CPU there are no instructions that can take a value from memory, add it to another value from memory and then store it in third memory location. So there must be multiple instructions executed already. Furthermore, memory accesses costs so much more than arithmetic instructions that anything that touches memory (unless it hits L1 cache all the time) will be dominated by the memory access time.

Furthermore, RDTSC might not even return the actual cycle count. Some CPUs have variable clock rates but still keep TSC going at the same rate regardless of how fast or slow the CPU is actually running because TSC is used by the operating system for time keeping. Others don't.

So you're not measuring what you think you're measuring and whoever told you those things was either oversimplifying vastly or hasn't seen CPU documentation in two decades.

Art
  • 19,807
  • 1
  • 34
  • 60