OpenCL producing QNaN on NVidia hardware

Question

I'm programming in OpenCL using the C++ bindings. I have a problem where on NVidia hardware, my OpenCL code is spontaneously producing very large numbers, and then (on the next run) a "1.#QNaN". My code is pretty much a simple physics simulation using the equation x = vt * .5at^2. The only thing I've noticed odd about it is that the velocities of the particles suddenly shoot up to about 6e+34, which I'm guessing is the maximum floating point value on that machine. However, the velocities/forces before then are quite small, often with values less than 1.

The specific GPU I'm using is the Tesla M2050, with latest drivers. I prototype on my laptop using my AMD Fusion CPU as a platform (it does not have a dedicated GPU) and the problem does not occur there. I am not sure if this is an NVidia driver problem, a problem with my computation, or something else entirely.

Here is the kernel code (Note: I'm reasonably sure mass is always nonzero):

_kernel void update_atom(__global float4 *pos, __global float4 *vel, __global float4 *force,
                        __global const float *mass, __global const float *radius, const float timestep, const int wall)
{
    // Get the index of the current element to be processed
    int i = get_global_id(0);
    float constMult;
    float4 accel;
    float4 part;

    //Update the position, velocity and force
    accel = (float4)(force[i].x/mass[i],
            force[i].y/mass[i],
            force[i].z/mass[i],
            0.0f);
    constMult = .5*timestep*timestep;
    part = (float4)(constMult*accel.x,
            constMult*accel.y,
            constMult*accel.z, 0.0f);
    pos[i] = (float4)(pos[i].x + vel[i].x*timestep + part.x,
                pos[i].y + vel[i].y*timestep + part.y,
                pos[i].z + vel[i].z*timestep + part.z,
                0.0f);
    vel[i] = (float4)(vel[i].x + accel.x*timestep,
                vel[i].y + accel.y*timestep,
                vel[i].z + accel.z*timestep,
                0.0f);
    force[i] = (float4)(force[i].x,
                force[i].y,
                force[i].z,
                0.0f);


    //Do reflections off the wall
    //http://www.3dkingdoms.com/weekly/weekly.php?a=2
    float4 norm;
    float bouncePos = wall - radius[i];
    float bounceNeg = radius[i] - wall;
    norm = (float4)(0.0f, 0.0f, 0.0f, 0.0f);
    if(pos[i].x >= bouncePos)
    {
        //Normal is unit YZ vector
        pos[i].x = bouncePos;
        norm.x = 1.0f;
    }
    else if(pos[i].x <= bounceNeg)
    {
        pos[i].x = bounceNeg;
        norm.x = -1.0f;
    }
    if(pos[i].y >= bouncePos)
    {
        //Normal is unit XZ vector
        pos[i].y = bouncePos;
        norm.y = 1.0f;
    }
    else if(pos[i].y <= bounceNeg)
    {
        //etc
        pos[i].y = bounceNeg;
        norm.y = -1.0f;
    }
    if(pos[i].z >= bouncePos)
    {
        pos[i].z = bouncePos;
        norm.z = 1.0f;
    }
    else if(pos[i].z <= bounceNeg)
    {
        pos[i].z = bounceNeg;
        norm.z = -1.0f;
    }
    float dot = 2 * (vel[i].x * norm.x + vel[i].y * norm.y + vel[i].z * norm.z);
    vel[i].x = vel[i].x - dot * norm.x;
    vel[i].y = vel[i].y - dot * norm.y;
    vel[i].z = vel[i].z - dot * norm.z;
}

And here's how I store the information into the kernel. PutData just uses the std::vector::push_back function on the positions, velocities and forces of the atoms into the corresponding vectors, and kernel is just a wrapper class I wrote for the OpenCL libraries (you can trust me that I put the right parameters into the right places for the enqueue functions).

void LoadAtoms(int kernelNum, bool blocking)
{
    std::vector<cl_float4> atomPos;
    std::vector<cl_float4> atomVel;
    std::vector<cl_float4> atomForce;
    for(int i = 0; i < numParticles; i++)
        atomList[i].PutData(atomPos, atomVel, atomForce);
    kernel.EnqueueWriteBuffer(kernelNum, posBuf, blocking, 0, numParticles*sizeof(cl_float4), &atomPos[0]);
    kernel.EnqueueWriteBuffer(kernelNum, velBuf, blocking, 0, numParticles*sizeof(cl_float4), &atomVel[0]);
    kernel.EnqueueWriteBuffer(kernelNum, forceBuf, blocking, 0, numParticles*sizeof(cl_float4), &atomForce[0]);
}

void LoadAtomTypes(int kernelNum, bool blocking)
{
    std::vector<cl_float> mass;
    std::vector<cl_float> radius;
    int type;
    for(int i = 0; i < numParticles; i++)
    {
        type = atomList[i].GetType();
        mass.push_back(atomTypes[type].mass);
        radius.push_back(atomTypes[type].radius);
    }
    kernel.EnqueueWriteBuffer(kernelNum, massBuf, blocking, 0, numParticles*sizeof(cl_float), &mass[0]);
    kernel.EnqueueWriteBuffer(kernelNum, radiusBuf, blocking, 0, numParticles*sizeof(cl_float), &radius[0]);
}

There is more to my code, as always, but this is what's related to the kernel.

I saw this question, which is similar, but I use cl_float4 everywhere I can so I don't believe it's an issue of alignment. There aren't really any other related questions.

I realize this probably isn't a simple question, but I've run out of ideas until we can get new hardware in the office to test on. Can anyone help me out?

Actually, I think the problem is in another kernel I wrote. I haven't fixed the problem yet, and problem was occurring even before I ran this other kernel, so observations are appreciated. — Chaosed0, May 31 '12 at 19:48
Just as a general coding style observation: You could make that code a lot shorter by using vector operations. It'll be easier to read and may well run faster too. e.g. `accel = (float4)(force[i].x/mass[i], force[i].y/mass[i], force[i].z/mass[i], 0.0f);` becomes `accel = force[i] / mass[i];` — Paul S, Jun 01 '12 at 08:36
Ah, I didn't even know OpenCL supported such an operation. Thanks. — Chaosed0, Jun 01 '12 at 12:59

score 0 · Accepted Answer · answered Jun 03 '12 at 23:13

Since no one answered, I suppose I'll just contribute what I've learned so far.

I don't have a definitive conclusion, but at the very least, I figured out why it was doing it so often. Since I'm running this (and other similar kernels) to figure out an estimate for the order of time, I was clearing the lists, resizing them and then re-running the calculations. What I wasn't doing, however, was resizing the buffers. This resulted in some undefined numbers getting pulled by the threads.

However, this doesn't solve the QNaNs I get from running the program over long periods of time. Those simply appear spontaneously. Maybe it's a similar issue that I'm overlooking, but I can't say. If anyone has further input on the issue, it'd be appreciated.

OpenCL producing QNaN on NVidia hardware

1 Answers1