I am writing some performance critical code (i.e. in a very tight loop and indicated by profiling) whose logic is basically (the_key
is a parameter and mmap_base
is the base address of a memory mapped file):
while (current_item && (struct my_struct *)(mmap_base + current_item) -> key < the_key){
/* Do something less performance critical */
current_item = (struct my_struct *)(mmap_base + current_item) -> next;
}
Profiling indicates that this piece of code is disk-bounded on dereferencing (mmap_base + current_item)
, which makes sense as random disk IO is considerably slow.
It is impossible to load relevant portion in mmap into the memory as the file is huge at about 100 GB. I am thinking about using something like __builtin_prefetch()
:
while (current_item && (struct my_struct *)(mmap_base + current_item) -> key < the_key){
__builtin_prefetch(mmap_base + ((struct my_struct *)(mmap_base + current_item) -> next), 0, 0);
/* Do something less performance critical */
current_item = (struct my_struct *)(mmap_base + current_item) -> next;
}
However, this will not work. It looks like __builtin_prefetch()
is not anyway useful on mmap-ed memory.
I then tried madvise()
:
while (current_item && (struct my_struct *)(mmap_base + current_item) -> key < the_key){
madvise(mmap_base + ((struct my_struct *)(mmap_base + current_item) -> next), sizeof(struct my_struct), MADV_WILLNEED);
/* Do something less performance critical */
current_item = (struct my_struct *)(mmap_base + current_item) -> next;
}
However, this even decreased the performance, and profiling showed that madvise()
call now becomes the major overhead.
Is there some compiler builtins (x86_64, GCC) or other ways to tell the kernel (linux) to prefetch data from disk into the memory/CPU cache?
Edit 1:
Some suggested that this is simply impossible without improving data locality. In such a case, however, I do wonder that why is it impossible to make an asynchronous read to the disk while moving on to the "less performance critical" part, which should allow faster access; is it more about kernel not implementing this or just theoretical/physical restrictions?
Edit 2:
Some recommended to use a separate thread to pre-access the memory so to let the kernel to prefetch them. However, I think threads can be expensive. Is it really helpful to start a thread for each prefetch? The code is in a tight loop, so that may mean a lot of threads will need to be started/joined. On the other hand, if I only use one thread, how should I communicate with it about what to prefetch?