I was reading something a few months ago about windows chipset iterations and PCH upgrades between them and I'm pretty sure I saw something on DMA cache coherency and that it involves the home agent or QHL (Nehalem) but I can't find it now.
So I ask if anyone knows the details of any method of DMA cache coherency that has been employed by Intel and how it works.
Nehalem's global queue on the optimisation manual:
Cacheline requests from the cores or from a remote package or the I/O Hub are handled by the GQ.
The global queue checks to see if the line is on the package and if it is, it snoops the appropriate cores using the core valid bits. If this is a dual socket system then the request will be sent to the QHL (Home agent on SnB) if home snoop is being used which will then send to the QPI link that the NUMA node bitmap refers to. If source snoop is being used then the GQ will check its own 2 bit i/o directory cache in order to generate a message for the correct QPI link the QHL (QPI agent on SnB) must generate another message to the correct LLC that has been assigned that address range. I'm not sure what happens on COD mode on Haswell or SNC on the mesh architecture.