Element-by-element vector multiplication with CUDA

Question

I have build a rudimentary kernel in CUDA to do an elementwise vector-vector multiplication of two complex vectors. The kernel code is inserted below (multiplyElementwise). It works fine, but since I noticed that other seemingly straightforward operations (like scaling a vector) are optimized in libraries like CUBLAS or CULA, I was wondering if it is possible to replace my code by a library call? To my surprise, neither CUBLAS nor CULA have this option, I tried to fake it by making one of the vectors the diagonal of a diagonal matrix-vector product, but the result was really slow.

As a matter of last resort I tried to optimize this code myself (see multiplyElementwiseFast below) by loading the two vectors in shared memory and then work from there, but that was slower than my original code.

So my questions:

Is there library that does elementwise vector-vector multiplications?
If not, can I accelerate my code (multiplyElementwise)?

Any help would be greatly appreciated!

__global__ void multiplyElementwise(cufftComplex* f0, cufftComplex* f1, int size)
{
    const int i = blockIdx.x*blockDim.x + threadIdx.x;
    if (i < size)
    {
        float a, b, c, d;
        a = f0[i].x; 
        b = f0[i].y;
        c = f1[i].x; 
        d = f1[i].y;
        float k;
        k = a * (c + d);
        d =  d * (a + b);
        c =  c * (b - a);
        f0[i].x = k - d;
        f0[i].y = k + c;
    }
}

__global__ void multiplyElementwiseFast(cufftComplex* f0, cufftComplex* f1, int size)
{
    const int i = blockIdx.x*blockDim.x + threadIdx.x;
    if (i < 4*size)
    {
        const int N = 256;
        const int thId = threadIdx.x / 4;
        const int rem4 = threadIdx.x % 4;
        const int i4 = i / 4;

        __shared__ float a[N];
        __shared__ float b[N];
        __shared__ float c[N];
        __shared__ float d[N];
        __shared__ float Re[N];
        __shared__ float Im[N];

        if (rem4 == 0)
        {
            a[thId] = f0[i4].x;
            Re[thId] = 0.f;
        }
        if (rem4 == 1)
        {
            b[thId] = f0[i4].y;
            Im[thId] = 0.f;
        }
        if (rem4 == 2)
            c[thId] = f1[i4].x;
        if (rem4 == 0)
            d[thId] = f1[i4].y;
        __syncthreads();

        if (rem4 == 0)
            atomicAdd(&(Re[thId]), a[thId]*c[thId]);        
        if (rem4 == 1)
            atomicAdd(&(Re[thId]), -b[thId]*d[thId]);
        if (rem4 == 2)
            atomicAdd(&(Im[thId]), b[thId]*c[thId]);
        if (rem4 == 3)
            atomicAdd(&(Im[thId]), a[thId]*d[thId]);
        __syncthreads();

        if (rem4 == 0)
            f0[i4].x = Re[thId];
        if (rem4 == 1)
            f0[i4].y = Im[thId];
    }
}

"elementwise vector-vector multiplication" Do you mean a dot product? — BenC, Jun 03 '13 at 15:12
@Benc... nope. For real vectors, dot product is the sum of element wise product. — sgarizvi, Jun 03 '13 at 15:45
@sgar91: If he is "multiplying" complex numbers, he may actually want to compute a sesquilinear form which could be called inner product/dot product in this case (see [this](http://en.wikipedia.org/wiki/Dot_product#Complex_vectors)). I just wanted to make sure of his intentions. — BenC, Jun 03 '13 at 15:50

BenC · Accepted Answer · 2013-06-04T02:41:17.537

If what you are trying to achieve is a simple element-wise product with complex numbers, you do seem to be doing some extra steps in your multiplyElementwise kernel that increase register usage. What you try to compute is:

f0[i].x = a*c - b*d;
f0[i].y = a*d + b*c;

since (a + ib)*(c + id) = (a*c - b*d) + i(a*d + b*c). By using your improved complex multiplication, you're trading 1 multiplication for 3 additions and some extra registers. Whether this can be justified or not might depend on the hardware you're using. For instance, if your hardware supports FMA (Fused Multiply-Add), that kind of optimization may not be efficient. You should consider reading this document: "Precision & Performance: Floating Point and IEEE 754 Compliance for NVIDIA GPUs" which also tackles the issue of floating-point precision.

Still, you should consider using Thrust. This library offers many high-level tools to operate on both host and device vectors. You can see a long list of examples here: https://github.com/thrust/thrust/tree/master/examples. This would make your life a lot easier.

UPDATED CODE

In your case, you could use this example and adapt it to something like this:

#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <time.h>

struct ElementWiseProductBasic : public thrust::binary_function<float2,float2,float2>
{
    __host__ __device__
    float2 operator()(const float2& v1, const float2& v2) const
    {
        float2 res;
        res.x = v1.x * v2.x - v1.y * v2.y;
        res.y = v1.x * v2.y + v1.y * v2.x;
        return res;
    }
};

/**
 * See: http://www.embedded.com/design/embedded/4007256/Digital-Signal-Processing-Tricks--Fast-multiplication-of-complex-numbers%5D
 */
struct ElementWiseProductModified : public thrust::binary_function<float2,float2,float2>
{
    __host__ __device__
    float2 operator()(const float2& v1, const float2& v2) const
    {
        float2 res;
        float a, b, c, d, k;
        a = v1.x;
        b = v1.y;
        c = v2.x;
        d = v2.y;
        k = a * (c + d);
        d =  d * (a + b);
        c =  c * (b - a);
        res.x = k -d;
        res.y = k + c;
        return res;
    }
};

int get_random_int(int min, int max)
{
    return min + (rand() % (int)(max - min + 1));
}

thrust::host_vector<float2> init_vector(const size_t N)
{
    thrust::host_vector<float2> temp(N);
    for(size_t i = 0; i < N; i++)
    {
        temp[i].x = get_random_int(0, 10);
        temp[i].y = get_random_int(0, 10);
    }
    return temp;
}

int main(void)
{
    const size_t N = 100000;
    const bool compute_basic_product    = true;
    const bool compute_modified_product = true;

    srand(time(NULL));

    thrust::host_vector<float2>   h_A = init_vector(N);
    thrust::host_vector<float2>   h_B = init_vector(N);
    thrust::device_vector<float2> d_A = h_A;
    thrust::device_vector<float2> d_B = h_B;

    thrust::host_vector<float2> h_result(N);
    thrust::host_vector<float2> h_result_modified(N);

    if (compute_basic_product)
    {
        thrust::device_vector<float2> d_result(N);

        thrust::transform(d_A.begin(), d_A.end(),
                          d_B.begin(), d_result.begin(),
                          ElementWiseProductBasic());
        h_result = d_result;
    }

    if (compute_modified_product)
    {
        thrust::device_vector<float2> d_result_modified(N);

        thrust::transform(d_A.begin(), d_A.end(),
                          d_B.begin(), d_result_modified.begin(),
                          ElementWiseProductModified());
        h_result_modified = d_result_modified;
    }

    std::cout << std::fixed;
    for (size_t i = 0; i < 4; i++)
    {
        float2 a = h_A[i];
        float2 b = h_B[i];

        std::cout << "(" << a.x << "," << a.y << ")";
        std::cout << " * ";
        std::cout << "(" << b.x << "," << b.y << ")";

        if (compute_basic_product)
        {
            float2 prod = h_result[i];
            std::cout << " = ";
            std::cout << "(" << prod.x << "," << prod.y << ")";
        }

        if (compute_modified_product)
        {
            float2 prod_modified = h_result_modified[i];
            std::cout << " = ";
            std::cout << "(" << prod_modified.x << "," << prod_modified.y << ")";
        }
        std::cout << std::endl;
    }   

    return 0;
}

This returns:

(6.000000,5.000000)  * (0.000000,1.000000)  = (-5.000000,6.000000)
(3.000000,2.000000)  * (0.000000,4.000000)  = (-8.000000,12.000000)
(2.000000,10.000000) * (10.000000,4.000000) = (-20.000000,108.000000)
(4.000000,8.000000)  * (10.000000,9.000000) = (-32.000000,116.000000)

You can then compare the timings of the two different multiplication strategies and choose what's best with your hardware.

Thank you very much, I'll look into Thrust to see if I can make it work for me. The reason that the complex product is calculated in such a "backward" way is that I read here [http://www.embedded.com/design/embedded/4007256/Digital-Signal-Processing-Tricks--Fast-multiplication-of-complex-numbers] that a processor has less trouble with additions than with multiplications, so it might be a bit quicker. It is still a bit strange though that a scaling of a vector, or a sum of two vectors is readily found in many libraries, but an elementwise multiplication is not... Anyway, thanks for the answer! — WVDB, Jun 03 '13 at 20:50
@WVDB: I see, I guess it is worth comparing, but that kind of optimization is really hardware dependent, and you need to keep in mind that the compiler may reorder operations if you are using compiler optimization. You can compare the timings with `nvprof`/`nvvp` or even the PTX code generated to have a better idea of what's really going on. — BenC, Jun 04 '13 at 00:36
@WVDB: I updated my answer to make comparison possible with NVIDIA's profiling tools. — BenC, Jun 04 '13 at 02:22
@ BenC Currently I am trying to profile my code with Nsight. Thanks for the solid advice once more! — WVDB, Jun 04 '13 at 08:18

score 0 · Answer 2 · answered Jan 25 '15 at 16:02

You can use cublasZdgmm.

cublasStatus_t cublasZdgmm(cublasHandle_t handle, cublasSideMode_t mode,
                      int m, int n,
                      const cuDoubleComplex *A, int lda,
                      const cuDoubleComplex *x, int incx,
                      cuDoubleComplex *C, int ldc)

Element-by-element vector multiplication with CUDA

2 Answers2

UPDATED CODE

Linked