1

i am working on a Cortex-A72 (Armv8) and i need to implement this pseudocode:

put addr1 into X9
put addr2 into X10
for i := 0 to N − 1 do
STR X0, [X9]
STR X0, [X10]
DC CVAC, X9
DC CVAC, X10

Following my C code:

int main(){

    unsigned char temp = 0xff;
    unsigned char *mem;
    mem = mmap(NULL, BUF_SIZE, PROT_READ | PROT_WRITE,MAP_ANON | MAP_PRIVATE, -1, 0);
    if (mem == MAP_FAILED){
        perror("mmap()");
        return 1;
    }
    memset(mem,0xff,BUF_SIZE);
    /* Select two random addresses within memory pool*/
    size_t offset1 = (rand()<<12)%BUF_SIZE;
    size_t offset2 = (rand()<<12)%BUF_SIZE;
    unsigned char *addr1 = (unsigned char*) (mem+offset1);  
    unsigned char *addr2= (unsigned char*) (mem+offset2)

    for (int i = 0; i < 10000; i++){
        asm volatile("str %x1, %x0" : "=m"(*addr1) : "r"(temp));
        asm volatile("str %x1, %x0" : "=m"(*addr2) : "r"(temp));
        __asm__ __volatile__("dc cvac, %0\n\t" : : "r" (addr1));
        __asm__ __volatile__("dc cvac, %0\n\t" : : "r" (addr2));
    }
      
     .
     .
     .
    
}

I just want to know if I am using the assembly code in the right way. The goal is to access data directly in physical memory, bypassing cache.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
fred_bd
  • 19
  • 3
  • You should try real assembly first then if you really really really have to then convert that to inline. – old_timer Jan 10 '21 at 22:53
  • `__NR_getrandom` syscall (384) will do for your random needs. If you need to allocate storage, `__NR_brk` syscall (45) will do -- but I would suggest using another storage pool (.bss), etc.. if it will fit. – David C. Rankin Jan 10 '21 at 23:21
  • Are you aware that you're doing 64-bit stores? The code in general looks like you want 8-bit stores (`strb`). Is this code just supposed to benchmark memory speed, or does it have some other purpose? – Nate Eldredge Jan 10 '21 at 23:25
  • GNU inline assembler can heavily optimize. It is at liberty to reuse/trash _any_ register _between_ `asm` blocks. To prevent that, I'd combine your _four_ separate blocks into a _single_ block. And, in general, you should ensure that you add a _clobber_ for each register that is used that is _not_ an input or output register – Craig Estey Jan 11 '21 at 00:08
  • 2
    I did a quick web search on `dc cvac`. It appears to be related to "cleaning the cache to the point of coherency": https://developer.arm.com/docs/ddi0595/latest/aarch64-system-instructions/dc-cvac I'm wondering _why_ you want to do that since `mmap` usually sets things up correctly [including flushing the cache if needed]. – Craig Estey Jan 11 '21 at 00:15
  • 2
    Don't forget a `"memory"` clobber on your `dc` instructions; you need previous (in source order) memory accesses to be done before the `dc` instruction runs. Or maybe dummy memory input operand could tell the compiler this instruction depends on that contents of that address. – Peter Cordes Jan 11 '21 at 04:22
  • @CraigEstey: I don't think this code is depending on registers living across asm statements. It actually *is* 2 separate stores, and 2 separate `dc cvac` instructions, using C variables properly as data inputs and memory-operand output. The only risk is that the `dc` statements aren't seen as having a dependency on the stores (see my previous comment), but actually the use of `volatile` makes reordering or hoisting impossible. Not sure why the OP wants this sequence of instructions in a loop body, but it's basically safe. (Except for only telling the compiler about a byte output but doing 8) – Peter Cordes Jan 11 '21 at 06:40
  • @NateEldredge no other purpose, the goal is to access the underlying DRAM in each loop iteration – fred_bd Jan 11 '21 at 08:44
  • @CraigEstey So, how would you implement that ? – fred_bd Jan 11 '21 at 08:45
  • @PeterCordes Are you sure the memory clobber in dc instruction is necessary ? – fred_bd Jan 11 '21 at 08:46
  • @CraigEstey i am not sure i got all that. What does it means 'mmap usually sets things up correctly ' ? What would be the differences if, for example, I used an array ? – fred_bd Jan 11 '21 at 08:55
  • A memory clobber isn't necessary if you keep the `volatile` on those statements and the stores; I wasn't accounting for that in my reasoning. If there's ever a case where you want dead-store elimination to be able to work, though, you'd need to remove `volatile` and use a `"+m"(*addr1)` dummy RMW operand in the `dc` statement, to tell the compiler it needs the value in memory. Telling the compiler it writes the memory is an overstatement, but the statement needs some output to not be implicitly `volatile`. And telling the compiler we write that mem will make sure it's before any read – Peter Cordes Jan 11 '21 at 09:00
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/227140/discussion-between-fred-bd-and-peter-cordes). – fred_bd Jan 11 '21 at 09:04

0 Answers0