CPU cache behaviour/policy for file-backed memory mappings?

Question

Does anyone know which type of CPU cache behaviour or policy (e.g. uncacheable write-combining) is assigned to memory mapped file-backed regions on modern x86 systems?

Is there any way to detect which is the case, and possibly override the default behaviour?

Windows and Linux are the main operating systems of interest.

(Editor's note: the question was previously phrased as memory mapped I/O, but that phrase has a different specific technical meaning, especially when talking about CPU caches. i.e. actual I/O devices like NICs or video cards that you talk to with loads / stores.

This question is actually about what kind of memory you get from mmap(some_fd, ...), when you don't use MAP_ANONYMOUS and it's backed by a regular file on disk.)

I might be very very wrong, but I doubt that they do anything special about that. File-backed memory pages are treated like any other typical memory area and cached normally. I say this because I have profiled access to file-backed memory (some years ago though,) and didn't observe any irregularity that might have been caused because of special caching. I don't have the results or the code anymore though. Also, what I tested was straight memory-mapped files, and not any device I/O and such. — yzt, Apr 06 '13 at 17:17
My experience agrees with @YaserZhian's. In fact, Windows (at least) seems to treat normal memory somewhat like a memory mapped file that just happens to be mapped to the swap file instead of some other file. — Jerry Coffin, Apr 06 '13 at 17:25
[This may help you](http://kerneltrap.org/mailarchive/linux-kernel/2008/4/29/1657814) — DOOM, Apr 06 '13 at 17:30
@MartinKällman: Detecting whether a memory range is cached ("cachable") or not should be easy while *reading*. You map your file; you read through the range (a few kilobytes should be best) to make sure the page faults happen and the data is loaded from file; then you go off and read a large and completely independent block of memory that is at least as large as your largest level of cache; then you come back and read your mapped area once and then again another time. If the time of the last two read passes are different (use RDTSC) then it is cached, otherwise it's not. — yzt, Apr 06 '13 at 21:33
@MartinKällman: You could also use a profiler that uses CPU performance counters. For example, I know that Intel VTune has the capability to show the number of cache misses in a particular piece of code. And the number of cache misses can probably tell you whether an area of memory is cached at all. By the way, you can do something like this for writes too (for *write combining* that you mention.) Again, I highly doubt that you can influence cache policy at user level; you certainly can do it at kernel or driver level though, but I suspect that would be for device I/O and not for files. — yzt, Apr 06 '13 at 21:40
@yzt Thanks, this might be helpful in trying to deduce this behaviour, but I need to be able to know this with certainty either through platform/OS specs or at runtime, programmatically. — awdz9nld, Apr 07 '13 at 13:35
@MartinKällman: If you need _certainty_, stop and rethink. Other processes can and will influence your process. Virus scanners in particular can make it hard to deduce what is happening. — MSalters, Apr 07 '13 at 23:08

score 16 · Accepted Answer · edited Apr 11 '18 at 22:53

TL:DR Memory mapped files use the normal Write-Back policy for pages of the pagecache that they map into the address space of your process. You have to do something special and OS-specific if you ever want pages that aren't WB.

Caching policy applied to the address space region is generally operating system independent and depends only on the type of device behind the address space page. In fact, the operating system is free to apply any caching policy to any memory region, but incorrectly assigned caching policy can reduce system performance or broke system logic at all.

There are at least four caching policies:

Full caching (write-back, aka WB). Applied to the physical address space mapped to the main memory (RAM). Used to increase the performance of memory subsystem performance. The main property of such devices is that its state can be changed only by software and can affect only software.

The memory mapped files implementation use full caching because they implemented completely by software (operating system) that read file chunk from disk and place it memory and then put this chunk (possibly modified) back to disk. Hardware updates a "dirty" bit in the page tables to let the OS figure out what needs to be synced to disk.
Write-through caching. (WT) The main property of such devices is that its state can be changed only by software, but the change must have an immediate effect on the device. According to this policy, data written to the memory-mapped IO device register will be placed in two places concurrently: in the cache and in the device. But when the data read will be initiated, data will be captured from the cache without expensive access to the device.

This cache policy could be useful for a MMIO device that doesn't write its memory, only reads what the CPU wrote. In practice it's rarely used for anything. GPUs aren't like that, and do write video memory, so it's not used for video RAM. (There's no mechanism for the GPU to invalidate CPU caches of the region, because the GPU isn't part of the CPU's cache-coherency domain)
Uncacheable, write-combining (WC aka USCW): Weakly ordered memory typically used for mapping video RAM. Like uncacheable, except that NT stores let you efficiently write a whole cache line at once. movntdqa loads let you efficiently read whole cache lines, which you can't do any other way from WC regions. Normal loads fetch data separately for each load, even within the same line, because it's uncacheable.
Disabled caching. (UC) Applied to the almost all IO device, because the writing to the memory-mapped IO device register must have immediate effect and read from the memory-mapped IO device register must return to the reader actual data from the device. If caching will be applied to memory-mapped IO device, then two negative effects will be introduced:
1. The writing to the memory-mapped IO device register will be delayed until the moment when cache controller will decide to flush cache line with written data. As result, the driver won't be able to know when the command written to the device will take effect.
2. The reading data from the memory-mapped IO device register can be cached. And subsequent data read from the same memory-mapped IO device register can return not actual data from the device, but outdated data from the cache. Due to this, it will be hard for the driver to capture the actual state of the device.

Due to the fact that the way by which software can specify caching policy is only processor dependent the same algorithm can be applied in any operating system. The simplest way is to capture the content of the CR3 register, and using it locate the Page Table Entry appropriate to the address which caching policy you want to know and check the PCD and PWT flags. But this way isn't complete because there are few other features that can affect caching (for example, caching can be completely disabled on CR0, see also MTRR, PAT).

score 2 · Answer 2 · answered Apr 07 '13 at 23:28

To add to ZarathustrA's existing answer: On Windows, SEC_NOCACHE turns of this caching. There's a SEC_WRITECOMBINE, but that appears broken (it only works with SEC_RESERVEor SEC_COMMIT, which means only with the page file, and you don't want to set SEC_WRITECOMBINE on that).

CPU cache behaviour/policy for file-backed memory mappings?

2 Answers2

Linked