3

Its clearly mentioned that Tegra TX1 has a shared memory. My question: Is that memory shared between CPU and GPU ? Or is that memory shared between different blocks in a GPU.

2 Answers2

5

The CPU and GPU have the same memory system. That is, the system DRAM memory also is the same physical memory that GPU global memory is allocated from. Various techniques, such as zero-copy, and Unified Memory, can mostly eliminate the logical distinction also between system memory data and GPU global data.

Furthermore, the GPU in a Tegra TX1, like all CUDA capable GPUs, has CUDA shared memory. This is memory that is shared between threads in a particular block, but it is not shared between different blocks in a GPU. The primary memory system that is shared between different blocks in a GPU is the global memory system, which on Tegra TX1 is (physically) the same as system DRAM memory, as already mentioned.

Robert Crovella
  • 143,785
  • 11
  • 213
  • 257
  • 2
    so basically while writing cuda programs , i dont have to worry about cudamalloc , cudamemcpy and cudafree statements. – kshitij srivastava Jun 30 '16 at 21:44
  • Not correct. There is still a *logical* distinction between host and device memory. You can work around or blur these distinctions, if you wish, using techniques such as zero-copy or unified memory. – Robert Crovella Jun 30 '16 at 21:46
  • So I have a very basic question. Suppose I am writing a cuda program to square some numbers which are present in an array on the CPU memory. The way we write cuda code where CPU and GPU memories are different is basically we copy the entire array from host memory into device memory and do the required computation. My question is: in case of Tegra TX1 if we follow this procedure is it going to create duplication of the data on DRAM ? – kshitij srivastava Jul 01 '16 at 13:42
  • 1
    Yes, and zero copy and unified memory are both techniques that could prevent the duplication. – Robert Crovella Jul 01 '16 at 13:43
  • @kshitij srivastava, the logical distinction of the Global (CPU+GPU main) memory means only that: You just need to call *cudamalloc()* at the very beginning of the program and give this/these pointer(s) [in CPU host code] to sensors or other *HW* to fill in the data... once HW signalls the buffer is full (if needed, *unlock* the buffer from the HW-queue first), you can immediatelly process the data in CUDA, because it already *is* in the CUDA Global memory which You have *marked* in the beginning. That's *ALL* :-) – Filip OvertoneSinger Rydlo Dec 19 '16 at 22:18
0

IF You have allocated the memory-block using cudamalloc(), then YES. It becomes automatically shared Global-Memory between CPU and GPU.

Please, do NOT confuse it with the CUDA local memory called "Shared MEM." {Shared between threads of the same BLOCK}.

Remember: "SHARED MEMORY" in CUDA is the TURBO-speed programmable cache inside of the GPU's SM unit! :-)

  • 1
    "IF You have allocated the memory-block using cudamalloc(), then YES. It becomes automatically shared Global-Memory between CPU and GPU." No, it does not. Memory allocated via `cudaMalloc()` is directly accessible ***only*** to device code. It is not directly accessible from host (CPU) code. – Robert Crovella Apr 27 '23 at 21:02