Unexpected output in C with access to ARRAY in memory with RDTSC

Question

Here is my program in C.

#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <stdint.h>

static int DATA[1024]={1,2,3,4,.....1024};

inline void foo_0(void)
{
    int j;
    puts("Hello, I'm inside foo_0");
    int k=0;
    for(j=0;j<1024;j++)
    {
        //k=DATA[j];
        DATA[j]+=1;
    }
    k+=0;
}


inline void foo_1(void)
{
    int j;
    puts("Hello, I'm inside foo_1");
    int k=0;
    for(j=0;j<1024;j+=4)
    {
        //k=DATA[j];
        DATA[j]+=1;
    }
    k+=0;
}

inline void foo_2(void)
{
    int j;
    puts("Hello, I'm inside foo_2");
    int k=0;
    for(j=0;j<1024;j+=16)
    {
        //k=DATA[j];
        DATA[j]+=1;
    }
    k+=0;
}


inline uint64_t rdtsc()
{
    unsigned long a, d;
    asm volatile ("cpuid; rdtsc" : "=a" (a), "=d" (d) : : "ebx", "ecx");// core i3/i7 
    return a | ((uint64_t)d << 32);
}

inline void clflush(volatile void *p)
{
    asm volatile ("clflush (%0)" :: "r"(p));
}

int main(void)
{

    volatile uint64_t start, end,temp;
    unsigned long long total_time=0;


    foo_0();  // This will load DATA array into memory
    foo_1();  // DATA already loaded into memory
    foo_2(); // DATA already loaded into memory

    printf("**********************\n");  
    start=rdtsc();
    foo_2();
    end=rdtsc();
    temp=end-start;
    total_time=temp;

    printf("Time taken foo_2 = %llu\n",total_time);             

    start=rdtsc();
    foo_1();
    end=rdtsc();
    temp=end-start;
    total_time=temp;

    printf("Time taken foo_1 = %llu\n",total_time);

    start=rdtsc();
    foo_0();
    end=rdtsc();
    temp=end-start;
    total_time=temp;

    printf("Time taken foo_0 = %llu\n",total_time);


    return 0;
}

output :

Hello, I'm inside foo_0
Hello, I'm inside foo_1
Hello, I'm inside foo_2
**********************
Hello, I'm inside foo_2
**Time taken foo_2 = 6350**
Hello, I'm inside foo_1
**Time taken foo_1 = 10056**
Hello, I'm inside foo_0
**Time taken foo_0 = 21726**

EDIT 1 :

When I remove puts() from all three functions , I got this result

Time taken foo_0 = 16448
Time taken foo_1 = 4438
**********************
Time taken foo_2 = 1548

I am expecting same access time for all three foo_0, foo_1 and foo_2 as when memory is accessed to access an integer a whole cache block is loaded (64 bytes is block size, so 16 integers are loaded inside cache block) , so access time for 1, 4, 16 integer must be same. I am not getting that. Where I am making mistake ? I am using gcc under linux. Thanks in advance .

EDIT 2

As per answer by BЈовић and suggestion of Leeor in comment, the execution time must be different as there are different no of addition operations performed in three different functions. Although there are same no of misses in all three cases, but no of execution of addition operation varies and for all operations data is accessed from cache(*not from main memory* as I suspect) the total time depends on no of addition operation performed on array, not on cache miss/hit. So I accept the answer of BЈовић and suggestion of Leeor. Thank you to all of you.

@BЈовић , after closing all user process and clearing linux caches using sync ; echo 3 > /proc/sys/vm/drop_caches , I got this result Time taken foo_0 = 16532 Time taken foo_1 = 4386 ********************** Time taken foo_2 = 1556 Time taken foo_0 = 16532 Time taken foo_1 = 4412 ********************** Time taken foo_2 = 1538 [root@localhost 10feb]# ./c Time taken foo_0 = 16542 Time taken foo_1 = 4436 ********************** Time taken foo_2 = 1470 — bholanath, Feb 10 '14 at 07:51
Why is the line of asterisks printing out between the time for `foo_1()` and the time for `foo_2()`? Are your results consistent between runs? Do you get similar results if you move *all* the console output to after timing the three calls? Also, what is the intent for the `k+=0;` statement? — Michael Burr, Feb 10 '14 at 08:08
@MichaelBurr I just removed other print statements with asterisks, I got following result- Time taken foo_0 = 16558 Time taken foo_1 = 4522 Time taken foo_2 = 1394. It is consistent after 1-2 runs. — bholanath, Feb 10 '14 at 09:08
Also, If i rearrange the foo functions , I am getting same result as above. — bholanath, Feb 10 '14 at 09:10
@MichaelBurr k+=0 ; is used while k=DATA[j] is used to suppress warning of -Wall flags. I forgot to comment it out while using DATA[j]+=1; — bholanath, Feb 10 '14 at 09:13

score 1 · Accepted Answer · answered Feb 10 '14 at 09:13

1

1) Your process is being pushed to the background by other processes, therefore it is not possible to get precise timings.

2) The times can not be possible be the same because of different loops:

foo_0() : for(j=0;j<1024;j++)
foo_1() : for(j=0;j<1024;j+=4)
foo_2() : for(j=0;j<1024;j+=16)

answered Feb 10 '14 at 09:13

BЈовић

62,405
41
173
273

@BJob , I closed all other user processes. only system processes are there. My expectation is while foo_0(), foo_1(),foo_2() is accessed the array whole DATA array is loaded into cache. Now , when I access for(j=0;j<1024;j++) for DATA[0], DATA[16],DATA[32] there will be miss in case it is not in cache which is true for other two cases also. – bholanath Feb 10 '14 at 09:17
@bholanath it is not possible to close all processes. Your data is small enough to store it in cache - that is not the problem. – BЈовић Feb 10 '14 at 09:23
as my array is small enough to loaded into cache, so for all three cases it will be accessed from cache and that's why they must give same access time, but my result show huge difference. – bholanath Feb 10 '14 at 09:43
1

@bholanath, cache access is not the only constraint, you need to run all these instructions through your CPU, memory pipeline, commit logic, etc. This would take different time for different lengths of loops. – Leeor Feb 10 '14 at 13:32
@Leeor , if you go through this link http://igoro.com/archive/gallery-of-processor-cache-effects/ then you can see // Loop 1 for (int i = 0; i < arr.Length; i++) arr[i] *= 3; // Loop 2 for (int i = 0; i < arr.Length; i += 16) arr[i] *= 3; the two for-loops take about the same time: 80 and 78 ms respectively. The author described the concept of cache as well as hardware prefetching . – bholanath Feb 10 '14 at 13:58
@bholanath How can two loops with different number of iterations take same time to execute? – BЈовић Feb 10 '14 at 14:09
@BЈовић As the running time of these loops is dominated by the memory accesses to the array, not by the integer multiplications/addition. see http://igoro.com/archive/gallery-of-processor-cache-effects/ – bholanath Feb 10 '14 at 14:12
@bholanath In `foo_0()`, the loop is executed 1024 times. In `foo_1()`, the loop is executed 256 times. In `foo_2()`, the loop is executed 64 times. Number of memory accesses is appropriately reduced. – BЈовић Feb 10 '14 at 14:26
1

@bholanath - you said it yourself - in the link you brought, the loop fetches the data from memory (it's 256MB), so the memory access time dominates everything and shadows any difference in the CPU. The number of mem accesses is the same. In your example however, the data fits in the L1, so the fetch time is very small - exposing the execution time. The cache accesses are still the same, but the number of executed instructions is 4x and 16x. – Leeor Feb 10 '14 at 14:42
Theoretically, while there is a miss the access time will be higher as the data needs to be brought from higher level of cache/memory as compared to hit. As cache block size is 64B, 16 integers will be accommodated inside a single cache block. As data is loaded into cache blockwise(means whole 64B data will be loaded, not only 4B will be loaded when we access DATA[0]). It means when we access DATA[0],then all other DATA[1],DATA[2]...DATA[15] also be loaded. So there will be no miss for DATA[1],DATA[2]...DATA[15] and access time is for those negligible. only 1 miss for every 16 access. – bholanath Feb 10 '14 at 14:47
@Leeor , I think you are right. Let me check by varying the size of my array . – bholanath Feb 10 '14 at 14:50

Unexpected output in C with access to ARRAY in memory with RDTSC

1 Answers1