Use uint2 (but define the temporary variable as 64-bit value: unsigned long long int) instead of arrays to be sure of alignment.
Be careful about the order of l and h.
__host__ __device__ __forceinline__ unsigned long long int hiloint2uint64(unsigned int h, unsigned int l)
{
unsigned long long int result;
uint2& src = *reinterpret_cast<uint2*>(&result);
src.x = l;
src.y = h;
return result;
}
The CUDA registers have a size of 32 bits anyway. In the best case the compiler won't need any extra code. In the worst case it has to reorder the registers by moving a 32-bit value.
Godbolt example https://godbolt.org/z/3r9WYK9e7 of how optimized it gets.