0

I'm having trouble using atomicMin to find the minimum value in a matrix in cuda. I'm sure it has something to do with the parameters I'm passing into the atomicMin function. The findMin function is the function to focus on, the popmatrix function is just to populate the matrix.

#include <stdio.h>
#include <cuda.h>
#include <curand.h>
#include <curand_kernel.h>
#define SIZE 4

__global__ void popMatrix(unsigned *matrix) {
    unsigned id, num;
    curandState_t state;
    id = threadIdx.x * blockDim.x + threadIdx.y;
    // Populate matrix with random numbers
    curand_init(id, 0, 0, &state); 
    num = curand(&state)%100;
    matrix[id] = num;

}

__global__ void findMin(unsigned *matrix, unsigned *temp) {
    unsigned id;
    id = threadIdx.x * blockDim.y + threadIdx.y;
    atomicMin(temp, matrix[id]);
    printf("old: %d, new: %d", matrix[id], temp);


}

int main() {
        dim3 block(SIZE, SIZE, 1);
    unsigned *arr, *harr, *temp;
        cudaMalloc(&arr, SIZE*SIZE*sizeof(unsigned));
        popMatrix<<<1,block>>>(arr);

    // Print matrix of random numbers to see if min number was picked right
    cudaMemcpy(harr, arr, SIZE*SIZE*sizeof(unsigned), cudaMemcpyDeviceToHost);
    for (unsigned i = 0; i < SIZE; i++) {
        for (unsigned j = 0; j < SIZE; j++) {
            printf("%d ", harr[i*SIZE+j]);
        }
        printf("\n");
    }
    temp = harr[0];
    findMin<<<1, block>>>(harr);

    
    return 0;
}
Zlorpo123
  • 47
  • 1
  • 6

1 Answers1

3

harr is not allocated. You should allocated it on the host side using for example malloc before calling cudaMemcpy. As a result, the printed values you look are garbage. This is quite surprising that the program did not segfault on your machine.

Moreover, when you call the kernel findMin at the end, its parameter is harr (which is supposed to be on the host side regarding its name) should be on the device to perform the atomic operation correctly. As a result, the current kernel call is invalid.

As pointed out by @RobertCrovella, a cudaDeviceSynchronize() call is missing at the end. Moreover, you need to free your memory using cudaFree.

Jérôme Richard
  • 41,678
  • 6
  • 29
  • 59
  • 1
    also [should have](https://stackoverflow.com/questions/19193468/why-do-we-need-cudadevicesynchronize-in-kernels-with-device-printf/19193537#19193537) a `cudaDeviceSynchronize()` after that final kernel call. – Robert Crovella Apr 20 '21 at 20:43
  • @RobertCrovella Yes indeed. This makes me realize that a `cudaFree` is also missing. Thank you. – Jérôme Richard Apr 20 '21 at 21:11
  • 2
    omitting the `cudaFree` statement might not be good practice, but it is not a functional issue. Just like in host code, all device side allocations should be automatically freed on the termination of the owning host process. But without the `cudaDeviceSynchronize()` (or something similar) you may not see the kernel printout even if everything else is correct. I do believe that `cudaFree` is a blocking call in this kind of usage, so it could probably be used in lieu of `cudaDeviceSynchronize()`. – Robert Crovella Apr 20 '21 at 21:16