I am profiling a very dump sorting algorithm for small input data (= 512 elements). I am invoking a kernel that reads coalesced form an array of structs.
The struct looks like this:
struct __align__(8) Elements
{
float weight;
int value;
};
The nvprof delivers the following instruction counts for L1 miss/hits and gdl instructions:
Invocations Avg Min Max Event Name
Kernel: sort(Elements*)
500 0 0 0 gld_inst_8bit
500 0 0 0 gld_inst_16bit
500 1024 1024 1024 gld_inst_32bit
500 0 0 0 gld_inst_64bit
500 0 0 0 gld_inst_128bit
500 120 120 120 l1_global_load_hit
500 120 120 120 l1_global_load_miss
500 0 0 0 uncached_global_load_tr.
If I change the layout of the struct as followed:
struct __align__(8) Elements
{
float weight;
float value;
};
The profiling output looks like this:
Invocations Avg Min Max Event Name
Device 0
Kernel: sort(Elements*)
500 0 0 0 gld_inst_8bit
500 0 0 0 gld_inst_16bit
500 0 0 0 gld_inst_32bit
500 512 512 512 gld_inst_64bit
500 0 0 0 gld_inst_128bit
500 0 0 0 l1_global_load_hit
500 120 120 120 l1_global_load_miss
500 0 0 0 uncached_global_load_tr.
There is no inpact on the execution time at all but i don't understand why the GPU performs 32 bit load instructions on the first variant of the code and 64 bit instructions on the second.
The kernel is invoked wiht 1 block and 512 threads (so l1_global_load_x counters may be incorrect). All takes place on a GeForce 480 with CUDA 5.0.
EDIT: The sort kernel (a little shortened):
__global__ void sort(Elements* nearest)
{
ThreadIndex idx = index();
__shared__ Elements temp[MAX_ELEMENTS];
__shared__ int index_cache[MAX_ELEMENTS];
temp[idx.x] = nearest[idx.x];
WeightedElements elem = temp[idx.x];
__syncthreads();
int c = 0;
// some index crunching
nearest[idx.x] = temp[c];
}