8

I'm writing an algorithm in OpenCL in which I'd need every work unit to remember a fair portion of data, say something between a long[70] and a long[200] or so per kernel.

Recent AMD devices have 32 KiB __local memory, which is (for the given amount of data per kernel) enough to store the info for 20-58 work units. However, from what I understand from the architecture (and especially from this drawing), each shader core also has a dedicated amount of private memory. I however fail to find its size.

Can anyone tell me how to find out how much private memory each kernel has?

I'm particularly curious about the HD7970, since I plan to buy some of these soon.

Edit: Problem solved, the answer is here in appendix D.

Luís Cruz
  • 14,780
  • 16
  • 68
  • 100
user1111929
  • 6,050
  • 9
  • 43
  • 73
  • 2
    I don't believe private memory is dedicated per core - it maps to the register file, which is per compute unit resource. Each work item gets registers allocated from the compute unit register file, how many are required determines the number of wavefronts in flight at any given instant. – talonmies Feb 17 '12 at 17:00
  • From the famous everywhere-seen drawing http://www.codeproject.com/KB/showcase/Memory-Spaces/image001.jpg I concluded that the private memory is physically different from the __local memory, no? – user1111929 Feb 17 '12 at 23:29
  • 2
    Yes, they are physically different. Private memory maps to the compute unit register file, local memory to compute unit level shared memory in most modern AMD devices. A few early OpenCL compatible GPUs didn't have on die shared memory, and local memory was just SDRAM. Neither is per core, and how much you use per work item for private and per work group for local effects the number of concurrent wavefronts running per compute unit. – talonmies Feb 18 '12 at 05:12
  • Ok. Then I should re-word my question: how large is this register file? How to find out its size, either in general or for the HD7970 specifically. – user1111929 Feb 18 '12 at 08:08
  • 1
    You have not understood, I think - private memory is (like the name says) private to each work item. But it is allocated to each work item from the compute unit register file(s), which acts as a common resource pool for all the work items running on a given compute unit. And I am pretty sure AMD's compiler puts a hard limit of 256 registers per work unit, irrespective of the size of the register file(s) on the GPU. – talonmies Feb 18 '12 at 15:02
  • And what is the size of 1 register? 64 bits? If so, that's a hard limit of 2 KB per work item, which is quite huge, no? I assume it should be much smaller (otherwise my problem is trivially solved as it can contain the long[200] purely in the register). – user1111929 Feb 18 '12 at 17:24
  • 1
    I think each register is a 32 bit word. But remember that all of the other variables in your code also consume registers. I think I remember typical AMD GPUs have a 64kb register file per compute unit which needs to be shared by a minimum either 4 or 8 wavefronts of 64 work items each. But I don't use their hardware much , so that might not be correct. Check the current release notes in their OpenCL SDK. – talonmies Feb 18 '12 at 17:39
  • Indeed, there it is, thanks! It's in Appendix D of the AMD APP OpenCL Programming Guide http://developer.amd.com/sdks/amdappsdk/assets/amd_accelerated_parallel_processing_opencl_programming_guide.pdf. Apparently a register is 128 bits (4x32) and there are 16384 for all modern high-end devices, so that's a remarkable 256KB per compute unit. Nice! If you can put this in a new answer, I can accept it and close the topic. – user1111929 Feb 18 '12 at 22:40

3 Answers3

4

The answer was given by user talonmies in the comments, so I'll write it in a new answer here to close the question.

These values can be found in Appendix D of the AMD APP OpenCL Programming Guide http://developer.amd.com/sdks/amdappsdk/assets/amd_accelerated_parallel_processing_opencl_programming_guide.pdf (a similar document exists for nVidia). Apparently a register is 128 bits (4x32) for AMD devices and there are 16384 registers for all modern high-end devices, so that's a remarkable 256KB per compute unit.

user1111929
  • 6,050
  • 9
  • 43
  • 73
0

I think you are looking for __local memory. That is what 32KB of local data storage is referring to. I don't think you can poll the device to get the private memory amount.

You can pass in a NULL long* cl_mem reference to allocate the memory. I think it is best to use a static amount of memory per WI. Assuming that long[200] will be required for each work item, you would use the code below. It would also be a good idea to divide the work into groups that have the same (or similar) memory requirements, in order to get the most out of the LDS memory.

void __kernel(__local long* localMem, const int localMemPerItem
       //more args...
       )
{
  //host has 'passed' localMemPerItem*get_local_size() long values in as locamMem
  //this work item has access to all of it, but can choose to restrict
  //itself to only the portion it needs.
  //work group size will be limited to CL_DEVICE_LOCAL_MEM_SIZE/(8*localMemPerItem)
  int startIndex=localMemPerItem*get_local_id(0);
  //use localMem[startIndex+ ... ]
}
mfa
  • 5,017
  • 2
  • 23
  • 28
  • 1
    You cannot poll it, but does it exist? From the famous everywhere-seen drawing http://www.codeproject.com/KB/showcase/Memory-Spaces/image001.jpg I assumed that there is a physically separate set of private registers on each work unit. No? I hoped to somehow do better than a CL_DEVICE_LOCAL_MEM_SIZE/(8*localMemPerItem) limitation, as it roughly leaves half of the cores unused. Accessing global memory would probably be way too slow, even though it is only incrementing a counter. – user1111929 Feb 17 '12 at 23:28
  • 1
    I found some more info about cypress, cayman, and fermi register sizes here: http://www.realworldtech.com/page.cfm?ArticleID=RWT121410213827&p=11 You should be able to tweak some decent sized private vars into that size. I think that the LDS will still be your best bet though. – mfa Feb 19 '12 at 15:08
0

To answer how large is register file in a 79xx series card, since its based on GCN architecture it is 64KB as per the image in anandtech : http://www.anandtech.com/print/5261

To answer your question how to find out how much memory each kernel uses.. you can look run AMD APP Profiler on your kernel, it tell you in the kernel occupancy section how much space is utilized by the kernel.

kiranputtur
  • 338
  • 1
  • 10
  • Oh really? That's weird. I thought to have found the answer, but it's a different one. In the AMD OpenCL programming guide http://developer.amd.com/sdks/amdappsdk/assets/amd_accelerated_parallel_processing_opencl_programming_guide.pdf in Appendix D, there is the total register file size, and it's listed as 256 KB for all modern devices. Which is correct now? :S – user1111929 Feb 20 '12 at 15:28
  • I believe both are correct. As I understand it, In the GCN architecture, one SIMD unit has a 64kb of register file, and there are 4 SIMD units per compute unit, ie. 4 * 64kb = 256kb of total register file per compute unit. – talonmies Feb 20 '12 at 15:45