6

I don't know if I'm just overlooking something obvious but despite due googling around I see no way to simply add a scalar to a vector (or matrix) using BLAS operations. I'm trying to do this in cuBLAS/CUDA so I'll take any way to accomplish this within that framework. BLAS has <t>scal for scalar multiplication (cublas<t>scal) but where is the analog for addition?! I.e. something analagous to GSL gsl_vector_add_constant. What am I missing?

Matt Phillips
  • 9,465
  • 8
  • 44
  • 75
  • 2
    This is a simple enough operation, why not write your own kernel ? Does it have to do with using your compiler of choice rather than nvcc ? – Pavan Yalamanchili Dec 27 '12 at 09:29
  • @Pavan Ha! Indeed why not? I think this is what I will probably do. I'm just getting started with CUDA/cuBLAS, I may be stuck in a mental 'box'. – Matt Phillips Dec 27 '12 at 17:02

2 Answers2

4

Probably the only way to do what you are asking is applying axpy with a unit vector of the same size scaled by the constant you want to add.

So the operation becomes X <- X + alpha * I, which is equivalent to adding alpha to each entry in X.


EDIT:

From comments, it seems that you foresee some difficulty in creating the unit vector for the SAXPY call. One way to do so is to use a memset call to set the values of the unit vector on the device, something like this:

#include "cuda.h"
#include "cuda_runtime_api.h"
#include "cublas_v2.h"
#include <iostream>

int main(void)
{

    const int N = 10;
    const size_t sz = sizeof(float) * size_t(N);
    float *A, *I;

    float Ah[N] = { 0., 1., 2., 3., 4., 5., 6., 7., 8., 9. };

    cudaMalloc((void **)&A, sz);
    cudaMemcpy(A, &Ah[0], sz, cudaMemcpyHostToDevice);

    // this creates a bit pattern for a single precision unity value
    // and uses 32-bit memset from the driver API to set the values in the
    // vector.
    const float one = 1.0f;
    const int* one_bits = reinterpret_cast<const int*>(&one);
    cudaMalloc((void **)&I, sz);
    cuMemsetD32(CUdeviceptr(I), *one_bits, N);

    cublasHandle_t h;
    cublasCreate(&h);

    const float alpha = 5.0f;
    cublasSaxpy(h, N, &alpha, I, 1, A, 1);

    cudaMemcpy(&Ah[0], A, sz, cudaMemcpyDeviceToHost);

    for(int i=0; i<N; i++) {
        std::cout << i << " " << Ah[i] << std::endl;
    }

    cublasDestroy(h);
    cudaDeviceReset();

    return 0;
}

Note here I have allocated and copied memory for the CUBLAS vectors using the CUDA runtime API directly, rather than use the CUBLAS helper functions (which are only very thin wrappers around the runtime API calls anyway). The "tricky" part is making a bit pattern and using a driver API memset function to set each 32 bit word of the array.

You could equally do the whole thing with a couple of lines of template code from the thrust library, or just write your own kernel, which could be as simple as

template<typename T>
__global__
void vector_add_constant( T * vector, const T scalar, int N)
{
    int tidx = threadIdx.x + blockIdx.x*blockDim.x;
    int stride = blockDim.x * gridDim.x;

    for(; tidx < N; tidx += stride) {
        vector[tidx] += scalar;
    }
}

[disclaimer: this kernel was written in the browser and is untested. Use as own risk]

talonmies
  • 70,661
  • 34
  • 192
  • 269
  • Right--but a) allocate an entire vector just to do this? I'm completely baffled as to why this should be necessary. b) How do you allocate a unit vector in cuBLAS? No random access there if I'm not mistaken. Allocating it on the CPU and then moving it to the GPU adds even more unnecessary work. – Matt Phillips Dec 27 '12 at 07:59
  • @MattPhillips: Technically, the operation you are asking about is mathematically undefined, which is why there is no BLAS operation for it. Your alternative would be to write a kernel to perform the operation (preferrably "fuse" several operations together into one kernel for efficiency reasons). I'll edit my answer with a CUBLAS example, including unit vector allocation if you can wait a couple of hours – talonmies Dec 27 '12 at 08:09
  • " the operation you are asking about is mathematically undefined" What? Scalar addition? I don't understand at all what you mean, but if so then why is it in the GSL? – Matt Phillips Dec 27 '12 at 08:12
  • I am not going to write a lecture about the properties of vector spaces in comments, but there really is no such thing in linear algebra as adding a scalar to a vector. Many computer implementations allow "broadcasting" to do this, but what they are really doing is the mathematical equivalent of the axpy operation I suggested, even if they don't explicitly form the unit vector or matrix when they do the operation. – talonmies Dec 27 '12 at 08:36
  • 1
    I'm aware of the definition of a vector space, and conceptually what I want can be thought of in terms of axpy as you describe, but as far as how to *implement* it, actually creating the unit vector etc. is frankly idiotic. x86 assembly has `add` as well as `mul`, hard for me to believe that nVidia assembly doesn't have `add` as well, in which case the implementation should be a matter of iterating and adding. But how can there possibly not be a high-level function to do this... – Matt Phillips Dec 27 '12 at 08:43
  • NB I mean, it's idiotic to force the programmer to go through axpy. Otherwise I'll have to do it on the CPU, neither option is good. – Matt Phillips Dec 27 '12 at 08:46
  • The BLAS API was developed by a bunch of very clever people during the late 1970's and early 1980's who were at the forefront of computer linear algebra research at the time. If you have a complaint about the design of BLAS, take it up with them. NVIDIA and everyone else who have written their own BLAS implementation are simply following the same design. – talonmies Dec 27 '12 at 09:16
  • @MattPhillips you said "why is it in the GSL?". I've looked through the GSL [documentation](http://www.gnu.org/software/gsl/manual/html_node/index.html#Top) and am not sure what you are referring to. Which operation are you referring to in the GSL that adds a scalar to a vector in a single function call? The reason I'm asking is that I wonder if there is a particular CBLAS type call that is missing from cublas (since some of what GSL offers is a CBLAS type implementation). – Robert Crovella Dec 27 '12 at 15:35
  • @RobertCrovella `gsl_vector_add_constant` – Matt Phillips Dec 27 '12 at 16:55
4

Four options, ranked from best to worst:

  • Find the function you need in a different library
  • Implement the function you need yourself
  • Allocate and initialize a constant vector, use that with *axpy.
  • Although strides of zero are formally not supported by the BLAS, some implementations treat a vector with stride 0 as a "scalar" in the sense that you want. Maybe cuBLAS does. However, depending on this is a really bad idea (so bad that I strongly considered not mentioning it), as this behavior is unsupported by the BLAS; your code will not be portable, and it may even be broken by future versions of the library unless nvidia makes a stronger API guarantee than BLAS does.
Stephen Canon
  • 103,815
  • 19
  • 183
  • 269
  • 1
    What implementations treat 0-stride vectors as scalar inputs? This sounds like a very useful feature – finnw Jan 23 '13 at 16:39
  • 1
    The BLAS that ships with macOS and iOS treats zero stride vectors as scalars. Depending on the size of your input vector or matrix, 0 stride vectors are also incredibly fast. – cutsoy May 04 '17 at 15:13
  • 1
    Not related to CUDA, but Intel MKL appears to support 0-stride vectors. – Anton Samsonov Jun 01 '17 at 18:52
  • CUBLAS did not support 0-stride in the past. I reported it as a bug but I do not know if it is fixed. If it helps, this is my code that exposes the bug: https://github.com/ParRes/Kernels/blob/default/Cxx11/transpose-cublas.cu#L175 – Jeff Hammond Jul 03 '20 at 18:32