1

I have a ZYNQ Ultrascale+ MPSoC Genesys ZU dev board that I'm running my application on. I have an accelerator in the PL that is connected to the PS through a simple AXI DMA. The DMA reads the DDR memory through a normal, non-coherent, FPD slave port on the PS. The application is running on one of the A53 cores in the PS.

I've verified with an ILA that the data being written to the AXI slave port is correct. However, some of the data I'm reading back in software was incorrect. At least part of the issue before was the cache in the A53. As a temporary solution I've disabled the D-cache at the start of the program so there should be no issues there anymore. Now though, the first time I try to print/read from the array of data I receive, I get an incorrect value. Subsequent reads return the correct value. What gives? How is this happening?

Using the Vitis debugger/memory viewer, I've verified that the correct data is present at the memory location I allocated and told the DMA to write to.

Below is a watered down version of the program, removing much of the program that has no issues.

#define CACHE_LINE_SIZE 64

int main(void)
{
    Xil_DCacheDisable();

    //A bunch of DMA initialization
    ...

    //Send data to accelerator through DMA, no issues here
    ...
    
    float* outputCorrelation;
    const size_t outputCorrelationSizeBytes = sizeof(*outputCorrelation) * 80;
    outputCorrelation = aligned_alloc(CACHE_LINE_SIZE, outputCorrelationSizeBytes);
    if(outputCorrelation == NULL) {
        printf("Aligned Malloc failed\n");
        return XST_FAILURE;
    }

    //Initiate data receive transfer first
    int result = XAxiDma_SimpleTransfer(&axiDma,(UINTPTR) outputCorrelation, outputCorrelationSizeBytes, XAXIDMA_DEVICE_TO_DMA);
    if(result != XST_SUCCESS) {
        return result;
    }

    //Send data - assembledData allocation isn't shown as no problems here
    result = XAxiDma_SimpleTransfer(&axiDma,(UINTPTR) assembledData, sizeof(*assembledData) * inLen, XAXIDMA_DMA_TO_DEVICE);
    if(result != XST_SUCCESS) {
        return result;
    }

    //Wait for completion interrupts from DMA
    ...

    for(size_t x = 0; x < 80; x++) {
        printf("[%zu]\t%f\n", x, outputCorrelation[x]);
    }
}

The expected output is the value 4 for every element of the array.

Output:

[0] -nan
[1] 4.000000
[2] 4.000000
[3] 4.000000
[4] 4.000000
...
[79] 4.000000

If I add a print of the any value of the array prior to for loop, the first value becomes correct and all values in the for loop are perfect. What's going on here and how can I solve it?

Edit: I had a thought that the compiler might be optimizing away the read or something since none of the functions directly write to the allocated array so I tried marking the output buffer as volatile. This did not change the behavior.


I did some more testing with my PL accelerator and tried connecting it to the LPD ports of the PS so I could try using the RPU instead of the APU. Using the exact same code in the RPU instead of the APU yielded my expected result. I have a suspicion there's still some issues with cache coherency even though I disabled the dcache when running on the APU.

Something I also didn't mention earlier is that when I single-step through my code, the issue does not exist. When still using the debugger but running through the critical sections, the issue does exist.

Christopher Moore
  • 15,626
  • 10
  • 42
  • 52
  • "If I add a print of the any value of the array prior to for loop, the first value becomes correct and all values in the for loop are perfect. What's going on here and how can I solve it?" could you explain this a little more? Do you add any print like printf("%f",array[44]); ? – Fra93 Jun 26 '22 at 19:36
  • When you say you market the output buffer as volatile did you write volatile float * outputCorrelation; or float* volatile outputCorrelation;? To indicate that the data pointed by outputCorrelation is volatile you need the first. – Fra93 Jun 26 '22 at 19:39
  • The last question I have for you is, did you try to isolate the assembly generated in case of the premature printf when everything works fine and the other code that prints the nan? Did you try to build with all the optimizations turned off? -O0 -g – Fra93 Jun 26 '22 at 19:42
  • @Fra93 For your first question, your assumption is correct. For the second, I used the correct volatile qualifier. For the third, I haven't played around with the generated assembly other than peeking at it a few times when single-stepping. Not sure what you mean by premature printf. I have tried running this in both the Debug and Release modes in Vitis. I believe Debug mode compiles with all optimizations disabled. I would have to check to be sure. I also made an update the question with some more discoveries I made. – Christopher Moore Jun 27 '22 at 18:25
  • my experience is that once you disable the dcache you should be fine. However you reminded of a solution I found once for a similar problem. Did you reserve, in your linker script, a memory area for the DDR zone you write with your accelerator/dma? – Fra93 Jun 28 '22 at 09:22
  • @Fra93 I dynamically allocated the memory with `aligned_alloc`. The heap is reserved in DDR memory in the linker script. So I think that's all good? – Christopher Moore Jun 28 '22 at 14:06
  • Oh, yes, sorry I missed that. I was using a static portion of the DDR defined as a global variable, that's why I had to insert the section in the linker script. Did you do any other test to debunk this mysterious behaviour? – Fra93 Jun 28 '22 at 15:59
  • I have another suggestion: you say that *"I've verified with an ILA that the data being written to the AXI slave port is correct. However, some of the data I'm reading back in software was incorrect"* so I am asking, what is coming out of the AXI4 master port of the PS **the first time you access the first element**? I want to see the axi4 transaction of the "NaN" case. Can you add the waveforms to the question? This will definetely tell us whether is a cache problem or something else. – Fra93 Jun 28 '22 at 16:55
  • @Fra93 The data being written by the DMA to the AXI4 port is correct. – Christopher Moore Jun 28 '22 at 17:59
  • I am sorry I can't be more of help, I keep thinking about this question. What happens if you write down the address of the first element of the array, then you stop with the debugger before accessing it and you use the xsct console with `mrd `. Does it print the correct value? – Fra93 Jun 28 '22 at 22:25
  • Aaand another solution would be to, instead of disabling the cache, to invalidate it before reading, by calling `Xil_DCacheInvalidate`? (https://github.com/Xilinx/embeddedsw/blob/master/lib/bsp/standalone/src/arm/cortexa9/xil_cache.c) – Fra93 Jun 28 '22 at 22:28

0 Answers0