10

I'm trying to use monitor/mwait instructions to monitor DMA writes from a device to a memory location. In a kernel module (char device) I have the following code (very similar to this piece of kernel code) that runs in a kernel thread:

static int do_monitor(void *arg)
{
  struct page *p = arg; // p is a 'struct page *'; it's also remapped to user space
  uint32_t *location_p = phys_to_virt(page_to_phys(p)); 
  uint32_t prev = 0;
  int i = 0;
  while (i++ < 20) // to avoid infinite loop
  {
    if (*location_p == prev)
    {
        __monitor(location_p, 0, 0);
        if (this_cpu_has(X86_FEATURE_CLFLUSH_MONITOR))
          clflush(location_p);
        if (*location_p == prev)
          __mwait(0, 0);
    }
    prev = *location_p;
    printk(KERN_NOTICE "%d", prev);
  }
}

In user space I have the following test code:

int fd = open("/dev/mon_test_dev", O_RDWR);
unsigned char *mapped = (unsigned char *)mmap(0, mmap_size, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
for (int i = 1; i <= 5; ++i)
  *mapped = i;
munmap(mapped, mmap_size);
close(fd);

And the kernel log looks like this:

1
2
3
4
5
5
5
5
5
5
5 5 5 5 5 5 5 5 5 5

I.e. it seems that mwait doesn't wait at all. What could be the reason?

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Igor R.
  • 14,716
  • 2
  • 49
  • 83
  • Have you checked that `MONITOR`/`MWAIT` is available (that is your BIOS hasn't turned it off)? secondly, you should execute the `clflush` *before* you monitor, else you just invalidate the cache, forcing it to be written back to memory (if its dirty) thus triggering the wait condition. – Necrolis Feb 10 '14 at 08:59
  • 1
    `MWAIT` can return "early". For one, due to nonmaskable events (NMI, SMI and a few other 'below-OS-control' interrupt mechanisms, as well as async faults), but second, more importantly, due to _ordinary_ interrupts unless they've been explicitly disabled (`__cli()` and/or `local_irq_disable()` in Linux ... not usually a good idea, lots of side effects). Using it _outside the OS'_ `idle()` _loop_ is ... a task pretty much equal to re-implementing that part of the scheduler within your driver code (your code quote is part of Linux' `idle()` ...). Are you writing kernel bypass code ? – FrankH. Feb 10 '14 at 11:04
  • @Necrolis thanks, of course `clflush` was not in the right place; however fixing it didn't help. MONITOR seems to be enabled, as per CPUID. – Igor R. Feb 10 '14 at 11:50
  • 1
    @FrankH. It's not a kernel bybass. I have a device that writes to a specific memory location (via DMA), and I'm experimenting with various ways to try and find out when it writes and what. – Igor R. Feb 10 '14 at 11:51
  • have you figour out the reason? I have the same problem now – Jianchen Oct 12 '14 at 01:37
  • 1
    @Jianchen No, I ended up making a busy-wait in the user space - it was good enough for my purposes. – Igor R. Oct 12 '14 at 11:08

1 Answers1

5

The definition of MONITOR/MWAIT semantics does not specify explicitly whether DMA transactions may or may not trigger it. It is supposed that triggering happens for logical processor's stores.

Current descriptions of MONITOR and MWAIT in the Intel's official Software Developer Manual are quite vague to that respect. However, there are two clauses in the MONITOR section that caught my attention:

  1. The content of EAX is an effective address (in 64-bit mode, RAX is used). By default, the DS segment is used to create a linear address that is monitored.

  2. The address range must use memory of the write-back type. Only write-back memory will correctly trigger the monitoring hardware.

The first clause states that MONITOR is meant to be used with linear addresses, not physical ones. Devices and their DMA are meant to work with physical addresses only. So basically this means that all agents relying on the same MONITOR range should operate in the same domain of virtual memory space.

The second clause requires the monitored memory region to be cacheable (write-back, WB). For DMA, respective memory range is usually has to be marked as uncacheable, or write-combining at best (UC or WC). This is even a stronger indicator that your intent to use MONITOR/MWAIT to be triggered by DMA is very unlikely to work on current hardware.


Considering your high-level goal - to be able to tell when a device has written to given memory range - I cannot remember any robust method to achieve it, besides using virtualization for devices (VTd, IOMMU etc.) Basically, the classic approach for a peripheral device is to issue an interrupt when it is done with writing to memory. Until an interrupt arrives, there is no way for CPU to tell if all DMA bytes have successfully reached their destination in memory.

Device virtualization allows to abstract physical addresses from a device in a transparent manner, and have an equivalent of a page fault when it attempts to write/read from memory.

Grigory Rechistov
  • 2,104
  • 16
  • 25
  • This doesn't sound right. Whether DMA uses physical addresses or not is really irrelevant. What matters is that the DMA request is cache coherent, which it can very well be. – Hadi Brais Aug 14 '19 at 01:20
  • @HadiBrais there may be cases when DMA triggers a MONITOR'ed range. It will correspond to spurious wakeups from MWAIT, allowed by the documentation, but not guaranteed to happen consistently between implementations. Linear/physical addresses was to underline the apparent hint in the documentation on the intended use of the instruction. "What matters is that the DMA request is cache coherent" — there is no documented link between the MWAIT operation and underlying cache infrastructure. And even if there is a connection, it may be very complex in cases of e.g. multi-socket systems. – Grigory Rechistov Aug 14 '19 at 07:36
  • That's all fine. My point was that maybe the MONITOR hardware does consistently work with coherent DMA stores, even if the documentation doesn't clearly specify that. The original question was "why is mwait not waiting?" This might have nothing to do with DMA. For example, the OP might have not disabled timer interrupts, which is one plausible explanation for why `printk` in the while loop is being executed multiple times. Another possible, but perhaps less likely, reason is that `location_p` does not point to memory of type WB, which is required on Intel processors. – Hadi Brais Aug 14 '19 at 15:03