0

I am writing some performance critical code (i.e. in a very tight loop and indicated by profiling) whose logic is basically (the_key is a parameter and mmap_base is the base address of a memory mapped file):

while (current_item && (struct my_struct *)(mmap_base + current_item) -> key < the_key){
    /* Do something less performance critical */
    current_item = (struct my_struct *)(mmap_base + current_item) -> next;
}

Profiling indicates that this piece of code is disk-bounded on dereferencing (mmap_base + current_item), which makes sense as random disk IO is considerably slow.

It is impossible to load relevant portion in mmap into the memory as the file is huge at about 100 GB. I am thinking about using something like __builtin_prefetch():

while (current_item && (struct my_struct *)(mmap_base + current_item) -> key < the_key){
    __builtin_prefetch(mmap_base + ((struct my_struct *)(mmap_base + current_item) -> next), 0, 0);
    /* Do something less performance critical */
    current_item = (struct my_struct *)(mmap_base + current_item) -> next;
}

However, this will not work. It looks like __builtin_prefetch() is not anyway useful on mmap-ed memory.
I then tried madvise():

while (current_item && (struct my_struct *)(mmap_base + current_item) -> key < the_key){
    madvise(mmap_base + ((struct my_struct *)(mmap_base + current_item) -> next), sizeof(struct my_struct), MADV_WILLNEED);
    /* Do something less performance critical */
    current_item = (struct my_struct *)(mmap_base + current_item) -> next;
}

However, this even decreased the performance, and profiling showed that madvise() call now becomes the major overhead.

Is there some compiler builtins (x86_64, GCC) or other ways to tell the kernel (linux) to prefetch data from disk into the memory/CPU cache?

Edit 1:
Some suggested that this is simply impossible without improving data locality. In such a case, however, I do wonder that why is it impossible to make an asynchronous read to the disk while moving on to the "less performance critical" part, which should allow faster access; is it more about kernel not implementing this or just theoretical/physical restrictions?

Edit 2:
Some recommended to use a separate thread to pre-access the memory so to let the kernel to prefetch them. However, I think threads can be expensive. Is it really helpful to start a thread for each prefetch? The code is in a tight loop, so that may mean a lot of threads will need to be started/joined. On the other hand, if I only use one thread, how should I communicate with it about what to prefetch?

user12986714
  • 741
  • 2
  • 8
  • 19
  • 2
    `madvise()` on less than a page at a time doesn't make sense. Needs to be used on bigger chunks of memory for best effect. – Shawn Aug 12 '20 at 23:57
  • Interesting problem. I imagine the reason why `__builtin_prefetch` doesn't do much is that it only fetches from memory into cache, and won't trigger a fault to load more memory from disk. I'm also not surprised that madvise is slower since I imagine the struct is relatively small. How much work is there per access? Can you improve data locality? – that other guy Aug 12 '20 at 23:57
  • @thatotherguy Well, the whole point of using a linked list is that improving data locality is extremely difficult... My particular use case is a tree in the mmap-ed file and I need to attach some tags onto some nodes randomly while inserting other tree nodes, hence the linked list – user12986714 Aug 13 '20 at 00:40
  • Particularly, using `madvise()` made the code ~23x slower, and in the past, overhead of that piece of code is ~41% and now overhead on `madvise()` is 89% – user12986714 Aug 13 '20 at 00:45
  • 1
    for madvise to be any use it's going to need to fetch stuff that's needed many access cycles ahead, and this is usually very hard to implement. Does changing the number of threads help at all ? (may also increase contention) this is probably an easier way than trying to predict access patterns. – camelccc Aug 13 '20 at 01:00
  • @camelccc It is writing data so it is protected by a giant pthread rwlock... – user12986714 Aug 13 '20 at 01:07
  • "*random disk IO is considerably slow.*" This does not look like random disk IO but sequential. – Acorn Aug 13 '20 at 01:10
  • @Acorn Well, as it is mainly a linked list, it may go like `1 --> 5 --> 3 --> 7 --> 2` or whatsoever as items get inserted – user12986714 Aug 13 '20 at 01:14
  • 3
    @user12986714 Ah, but then the very first thing you should be doing is improve data locality on the data at rest. There is very little you will be able to do reliably if you are making a disk seek around 100 GB of data for tiny chunks which need to be read to know the next tiny chunk. Either that, or buy new hardware suited to that access pattern. – Acorn Aug 13 '20 at 01:32
  • Improving locality is merely the dramatically much more efficient approach, not the only one. You can schedule a an async read by accessing the page in a separate thread and maybe even get a speedup if the work/read takes long enough, but this still corresponds to random access with a queue depth of 1 so it sounds like trying to hand optimize a bubble sort. – that other guy Aug 13 '20 at 02:50
  • @thatotherguy That would be an interesting option; however, am I really going to spawn a thread each time just to prefetch the memory? I think thread itself may be as/more costly; if I uses a single prefetcher thread interthread communication may also introduce as much/more cost. – user12986714 Aug 13 '20 at 02:54
  • I suspect that unless your "less performance critical" task takes a very long time, any speedup from starting the disk I/O in the background is outweighed by the overhead of an extra system call to get it started. – Nate Eldredge Aug 14 '20 at 00:58
  • 1
    Certainly don't start a new thread each time; have a single thread and keep it running. Communication between threads is designed to be fast, especially if they are running on separate cores so that no context switch is needed. – Nate Eldredge Aug 14 '20 at 00:59

1 Answers1

1

This type of access pattern will always be slow, because it potentially jumps around, without any sensible way to predict the pattern.

The approach I would try, is to generate a separate memory-mapped key index file, with just the key values and the offset of the corresponding record; with keys sorted in increasing order. That way, to find a specific key takes roughly O(log N) time complexity (depending on how you deal with duplicate keys), using a very simple binary search.

If the keys in the 100 GB file are being modified during operation, a single flat file is unsuitable for describing the data.

If you can handle the code complexity, partitioned binary search trees in array form have even better performance. In that case, you split the index file into fixed-size parts, say 64 kB (4096 key-offset pairs), containing in array form a rectangular part of the perfectly balanced binary search tree. For example, the very first partition contains the middle keys, the 1/4 and 3/4 keys, the 1/8, 3/8, 5/8, and 7/8 keys, and so on. Furthermore, you only include the keys in the primary index file, and use a secondary index file for the record offsets. (If you have duplicate keys, have the secondary index file refer to the first one, with each duplicate second index file entry referring to the next one, so you can track the chain directly with small time penalty but no extra space cost.)

This has much better locality than a binary search on a sorted array has, but the code and logic complexity is a bit daunting.

Gonbidatu
  • 66
  • 1