Are there performance/storage differences between uint2 and uint64_t in cuda10+?

Question

I'm trying to optimize a piece of code for A100 GPUs (ampere gen), right now we use uint64_t but I am seeing uint2 datatypes being used instead in some cuda code. Does the uint2 offer advantages for register usage? I know there are a limited number of 64-bit registers, does uint2 split the x,y ints across 32-bit registers for better occupancy? I couldn't find any specific information about register storage with these datatypes so any links to documentation for it would be appreciated.

Without a concrete example, I perceive this as speculative question about implementation artifacts (so not documented, not guaranteed, subject to change at any time). By observation: GPU registers comprise 32 bits each. Any 64-bit data types therefore must occupy two registers. Where 64-bit operands are consumed or produced by machine instructions, they occupy a register pair (two *consecutive* registers, with the least significant 32 bits stored in an even-numbered register, e.g. R0, R2, etc). In all other cases the compiler is free to store a 64-bit operand in any two registers. — njuffa, Mar 07 '22 at 21:59
No, there are not performance/storage differences between `uint2` and `uint64_t` in CUDA. — Robert Crovella, Mar 07 '22 at 23:09

talonmies · Accepted Answer · 2022-03-08T01:17:12.537

Does the uint2 offer advantages for register usage?

No.

I know there are a limited number of 64-bit registers

Indeed. Extremely limited, i.e. zero. There are no 64 bit registers in any CUDA compatible GPU I am aware of. When the compiler encounters a 64-bit type, it composites it from two adjacent 32-bit registers.

does uint2 split the x,y ints across 32-bit registers for better occupancy?

No. All the CUDA built-in vector types exist for memory bandwidth optimization (there are vector load/store instructions in PTX) and for compatibility with the texture/surface hardware which can do filtering on some of those types, which can be better for performance.

Are there performance/storage differences between uint2 and uint64_t in cuda10+?

1 Answers1