-1

in this program

#include <stdio.h>
#include <stdint.h>
int main()
{
    
    uint16_t *data=(uint16_t[]){1,2,3,4,5,6,7,8,9,10};
    int mlen=10;
    uint16_t partial=0;
       __builtin_prefetch(data + 8);
    while (mlen >0) {
    
       partial += *(uint16_t *)data;
       
        

       data += 1;
       mlen -= 1;
   }   
    return 0;
}

I am using __builtin_prefetch(data + 8); so until index 8 will be fetched in cache. But I I compile the program with

  gcc prefetcher.c -DDO_PREFETCH -o with-prefetch -std=c11 -O3

it is slower then this

  gcc prefetcher.c -o no-prefetch -std=c11 -O3

this is the output respectively

         12401      L1-dcache-load-misses     #    6.76% of all L1-dcache accesses
        183459      L1-dcache-loads                                             

   0.000881880 seconds time elapsed

   0.000952000 seconds user
   0.000000000 seconds sys

and this is without prefetcher

         12991      L1-dcache-load-misses     #    6.87% of all L1-dcache accesses
   189161      L1-dcache-loads                                             

   0.001349719 seconds time elapsed

   0.001423000 seconds user
   0.000000000 seconds sys

What I need to do it correctly so my __builtin_prefetch code run faster

above output is from perf progarm

user786
  • 3,902
  • 4
  • 40
  • 72
  • Is there a reason `data` is a pointer instead of an array? – Some programmer dude Nov 19 '21 at 07:33
  • The times "without prefetcher" are 50% slower, so it seems your __builtin_prefetch code is *already* faster. – user3386109 Nov 19 '21 at 07:34
  • 2
    And why do you need to make your "code run faster"? What is the actual problem you're trying to solve? What performance requirements do you have? How have you measured that this is one of the top-two bottlenecks in your program? – Some programmer dude Nov 19 '21 at 07:34
  • `The times "without prefetcher" are 50% slower, so it seems your __builtin_prefetch code is already faster` can u please help me understand the output of perf program in perf output. how are u reading it faster from the question output – user786 Nov 19 '21 at 07:35
  • @Someprogrammerdude so actually its just for learning but I plan in use it after I understand it on reading IP ranges using getter buffers in readv – user786 Nov 19 '21 at 07:37
  • The time elapsed is 0.00088 with prefetch, and 0.00135 without. So you need to explain why you think prefetch is slower. – user3386109 Nov 19 '21 at 07:42
  • @user3386109 oh sorry I yes u are right so `user` one I believe I should consider. – user786 Nov 19 '21 at 07:58
  • @user3386109 also does `__builtin_prefetch(data + 8);` fetches first 8 elements from array in cache. I believe it does. so does it fetches 8 elements from array into the cache? – user786 Nov 19 '21 at 07:59
  • But `user` is the same thing: 9 with prefetch, 14 without prefetch. – user3386109 Nov 19 '21 at 08:00
  • @user3386109 sorry this line `0.000881880 seconds time elapsed` – user786 Nov 19 '21 at 08:02
  • @user3386109 also does __builtin_prefetch(data + 8); fetches first 8 elements from array in cache. I believe it does. so does it fetches 8 elements from array into the cache? – user786 Nov 19 '21 at 08:02
  • @user786 Your code **just created and initialized the array**. Why do you think it isn't already in cache? You haven't demonstrated that the prefetch is doing anything for your performance, nor have you demonstrated that adding a prefetch to more complex code would help. Even if adding a prefetch makes this trivial code run faster, that doesn't mean it would do the same for more complex code. You're just as likely to have your prefetch kick something else out that would have been even more useful for performance. – Andrew Henle Nov 19 '21 at 09:01
  • @AndrewHenle can reading a file does it demonstrate the real performance boost of prefetch? Can u tell me this? – user786 Nov 19 '21 at 09:23
  • 1
    @user786 Optimization and performance analysis are very complex subjects. Think about what happens on an x86 system: how many processes are running at the same time? How many registers are being used? How big are the CPU caches? How are the caches being used by all the running processes? Now add in that the CPU**s** "pipeline" instructions, running multiple instructions from a process at a time in a, well, pipeline. What instructions are being run? How do they interact with each other? You can't take the results from one program and say any optimization would do the same to another. – Andrew Henle Nov 19 '21 at 09:39

1 Answers1

2

What I need to do it correctly so my __builtin_prefetch code run faster

You need to remove __builtin_prefetch. It's literally the only instruction that differs between code snippets. Compiler optimized your whole code to a no-op, as there are no side effects in your code.

Your first code snippet is compiled to:

main:
        xor     eax, eax
        ret

While your second code is compiled to:

main:
        xor     eax, eax
        prefetcht0      [rsp-24]
        ret

Even if you do return partial for example, the compiler is able to calculate the entire result at compile time and reduce the entire program to just return <constant>.

You can inspect the generated assembly of your programs with ease using https://godbolt.org/ .

KamilCuk
  • 120,984
  • 8
  • 59
  • 111
  • Ok. Also can u please tell if if I read a file then can builtin_prrfetch be helpful? If I prfetch buffer in advance in consecutive read from file in while loop? – user786 Nov 19 '21 at 14:51