How is DMA cache coherency kept on Intel chipsets?

Question

I was reading something a few months ago about windows chipset iterations and PCH upgrades between them and I'm pretty sure I saw something on DMA cache coherency and that it involves the home agent or QHL (Nehalem) but I can't find it now.

So I ask if anyone knows the details of any method of DMA cache coherency that has been employed by Intel and how it works.

Nehalem's global queue on the optimisation manual:

Cacheline requests from the cores or from a remote package or the I/O Hub are handled by the GQ.

The global queue checks to see if the line is on the package and if it is, it snoops the appropriate cores using the core valid bits. If this is a dual socket system then the request will be sent to the QHL (Home agent on SnB) if home snoop is being used which will then send to the QPI link that the NUMA node bitmap refers to. If source snoop is being used then the GQ will check its own 2 bit i/o directory cache in order to generate a message for the correct QPI link the QHL (QPI agent on SnB) must generate another message to the correct LLC that has been assigned that address range. I'm not sure what happens on COD mode on Haswell or SNC on the mesh architecture.

At a high level, DMA cache coherence is not very different from core cache coherence. Although the (many) QPI/UPI transactions and the paths they follow are different. There are *so many* details scattered in Intel manuals, some of which are not (clearly) documented. DMA cache coherence can be completely disabled in the chipset. Also some DMA requests can be non-coherent. Coherent requests are sent to the home node of the physical memory location. DMA write requests can be cached in an IOH agent (which is also a caching agent (CA)). I usually start with the uncore manual of the processor. — Hadi Brais, Apr 20 '19 at 02:16
@I sort of envision it like this: when a PCIe transaction takes place (write to memory), the physical address must also be sent onto the ring which will be picked up by the LLC slice that has been configured to cache that range and there must be some sort of indication in the ring transaction that the LLC needs to invalidate all cores in the snoop filter if it has been cached and it sends out invalidates to the cores and they could probably be the same format as the invalidate it sends when it receives an RFO from a core. — Lewis Kelsey, Apr 20 '19 at 08:47
Except a TLP would have to be parsed and all the lines in the range of the data size invalidated — Lewis Kelsey, Apr 20 '19 at 15:34

How is DMA cache coherency kept on Intel chipsets?

0 Answers0