3

I'm having a very weird problem with my program. Essentially I'm doing a matrix multiplication on part of a matrix. The program apparently runs fine on most cards cards but crashes on sm_35 Kepler (=GK110) cards.

The initial program was written in PyCUDA, but I've since managed to boil it down to the following minimal example written in C:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <cuda.h>
#include <cuda_runtime.h>
#include <cublas_v2.h>

int main(int argc, char **argv)
{
    cudaError_t status;
    cublasStatus_t status_blas;
    CUresult status_drv;
    float *A = 0;
    float *B = 0;
    float *C = 0;
    float alpha = 1.0f;
    float beta = 0.0f;
    float *oldA, *oldB, *oldC;
    cublasHandle_t handle;
    int n = 131;
    int m = 2483;
    int k = 3;
    int i;
    CUcontext ctx;

    cuInit(0);
    status_drv = cuCtxCreate(&ctx, 0, 0);
    if (status_drv != CUDA_SUCCESS) {
        fprintf(stderr, "!!!! Context creation error: %d\n", status);
        return EXIT_FAILURE;
    }
    status_blas = cublasCreate(&handle);
    if (status_blas != CUBLAS_STATUS_SUCCESS) {
        fprintf(stderr, "!!!! CUBLAS initialization error\n");
        return EXIT_FAILURE;
    }

    for (i = 0; i < 5; ++i) {
        printf("Iteration %d\n", i);
        if (cudaMalloc((void **)&B, m * k * sizeof(B[0])) != cudaSuccess) {
            fprintf(stderr, "!!!! allocation error (allocate B)\n");
            return EXIT_FAILURE;
        }
        if (cudaMalloc((void **)&C, m * m * sizeof(C[0])) != cudaSuccess) {
            fprintf(stderr, "!!!! allocation error (allocate C)\n");
            return EXIT_FAILURE;
        }
        if (cudaMalloc((void **)&A, n * m * sizeof(A[0])) != cudaSuccess) {
            fprintf(stderr, "!!!! allocation error (allocate A)\n");
            return EXIT_FAILURE;
        }
        int s = 3;
        float * A_slice = A + 128*m;
        status_blas = cublasSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, m, m, s,
            &alpha, A_slice, m, B, k, &beta, C, m);
        if (status_blas != CUBLAS_STATUS_SUCCESS) {
            fprintf(stderr, "!!!! kernel execution error.\n");
            return EXIT_FAILURE;
        }
        if (i == 0) {
            oldA = A;
            oldB = B;
            oldC = C;
        } else if (i == 1) {
            status = cudaFree(oldA);
            if (status != cudaSuccess) {
                fprintf(stderr, "!!!! allocation error (free A, %d)\n", status);
                return EXIT_FAILURE;
            }
            if (cudaFree(oldB) != cudaSuccess) {
                fprintf(stderr, "!!!! allocation error (free B)\n");
                return EXIT_FAILURE;
            }
            if (cudaFree(oldC) != cudaSuccess) {
                fprintf(stderr, "!!!! allocation error (free C)\n");
                return EXIT_FAILURE;
            }
        }
    }
    cublasDestroy(handle);
    cuCtxDestroy(ctx);
    return 0;
}

I only free memory in the 2nd iteration of the for loop to mimic the behavior of the original python program. The program will crash in the 2nd iteration of the for-loop when it tries to free "A", with cudaFree returning a cudaErrorIllegalAddress error.

Concretely, the was tried on the following cards:

  • NVS 5400M -> no issues
  • GTX560Ti -> no issues
  • Tesla S2050 -> no issues
  • unknown sm_30 card (see comments to this post) -> no issues
  • K40c -> CRASH
  • GTX 780 -> CRASH
  • K20m -> CRASH

I used a number of Linux machines with different distributions, some of them using CUDA 5.5 and some using CUDA 6.0. At least on the machines I have direct control over, all cards were using the 331 nvidia driver series.

There are several things to note here:

  • the order of the malloc calls matters. If I allocate A before B things run fine
  • the numerical constants matter a bit. For some values (e.g. n=30) no crash occurs, for others there is a crash
  • The order of the free/malloc calls matter. If I free the memory in the same iteration where I allocate, everything works just fine

At this point I'm pretty desperate. I don't see why or where I'm doing anything wrong. If anyone could help me, I'd really appreciate it.

EDIT: as pointed out in the comments, apparently it only fails to run on sm_35 (i.e., GK110 cards), but runs fine on sm_30 Kepler cards.

Robert Crovella
  • 143,785
  • 11
  • 213
  • 257
Untom
  • 85
  • 8
  • The code you have posted won't actually compile. The enumerated types from the Driver API and CUBLAS are not interchangable. – talonmies Jul 10 '14 at 08:42
  • It compiles fine on my machine, using `nvcc test.c -o test -lcublas -lcuda`. What error are you getting? – Untom Jul 10 '14 at 08:44
  • Ah, now I see, I checked the `cuCtxCreate` call for `CUBLAS_STATUS_SUCCESS`. Stupid mistake on my part. However in this case it's okay since both the Driver API and CUBLAS define "0" as their "call was successful" value. I'll fix the code above. – Untom Jul 10 '14 at 08:52
  • http://pastebin.com/Aqu4YMgh - As I said, cublasStatus_t, CUresult and cudaError_t are not the same type and are not interchangeable. After I fixed that, the code runs without error for me on an sm_30 Kepler system. – talonmies Jul 10 '14 at 08:52
  • Interesting! I haven't had the chance to tr this on sm_30, only on sm_35! Thanks for your tests! Odd that I don't get any compile errors, though. What architecture (CUDA, gcc versions) are you using? – Untom Jul 10 '14 at 08:55
  • http://pastebin.com/yyRag7Qi (EDIT: added nvidia-smi output) – Untom Jul 10 '14 at 08:58
  • I ran that on a linux x86_64 system with CUDA 6, gcc 4.6.3, 331.62 kernel driver – talonmies Jul 10 '14 at 09:48
  • Thansk for the information. If you don't mind, what card exactly did you use? Looks like it's just the GK110 chips that fail to run this :/ – Untom Jul 10 '14 at 10:46
  • I was able to reproduce the issue, on both CUDA 6.0 and CUDA 6.5RC, on CC 3.5 devices only (tested both GK208/GT640 and K40m). I have filed a bug with NVIDIA. Note that the problem does not seem to be dependent on the driver API. All driver API framework can be removed from this code, and the problem persists. I am now able to reproduce a similar observation using the code [here](http://stackoverflow.com/questions/24535247/cublas-call-to-gemm-fails-for-some-compatible-matrix-dimensions) as well, in spite of my (earlier) comments on that question. – Robert Crovella Jul 10 '14 at 12:16
  • Thanks for taking the time to do this! :) Is the bug report by any chance publicly accessible? – Untom Jul 10 '14 at 15:04
  • It is not. But if you wish to [file your own report](https://developer.nvidia.com/), feel free to refer to bug ID 1532781. I am paying attention to it, and if I learn anything useful, I will report it back here in this SO question. – Robert Crovella Jul 10 '14 at 16:06
  • 2
    It is confirmed that this is a bug in the CUBLAS library at this time, in particular when k<8 (your k is 3 in this problem). It is the same issue as described [here](http://stackoverflow.com/questions/24535247/cublas-call-to-gemm-fails-for-some-compatible-matrix-dimensions). It is not fixed in CUDA 6.5 RC but it is expected to be fixed in the CUDA 6.5 production release package. – Robert Crovella Jul 10 '14 at 16:41

1 Answers1

3

This issue should be fixed in the CUDA 6.5 production release package, now available for download from http://www.nvidia.com/getcuda

Robert Crovella
  • 143,785
  • 11
  • 213
  • 257