0

I have a thrust device_vector divided into chunks of 100 (but altogether contiguous on GPU memory), and i want to remove the last 5 elements of each chunk, without having to reallocate a new device_vector to copy it into.

// Layout in memory before (number of elements in each contiguous subblock listed):
// [   95   | 5 ][   95   | 5 ][   95   | 5 ]........

// Layout in memory after cutting out the last 5 of each chunk (number of elements listed)
// [  95  ][  95  ][  95  ].........

thrust::device_vector v;
// call some function on v;

// so elements 95-99, 195-99, 295-299, etc are removed (assuming 0-based indexing)

How can I correctly implement this? Preferably I would like to avoid allocating a new vector in GPU memory to save the transform into. I understand there are Thrust template functions for dealing with these kinds of operations, but I have trouble stringing them together. Is there something Thrust provides that can do this?

user1522407
  • 131
  • 2
  • 4

1 Answers1

1

No allocation of the buffer mem means you have to preserve the copying order, which can not be paralleled to fully utilize the GPU hardware.

Here's a version for doing this using Thrust with a buffer mem.

It requires Thrust 1.6.0+ since the lambda expression functor is used on iterators.

#include "thrust/device_vector.h"
#include "thrust/iterator/counting_iterator.h"
#include "thrust/iterator/permutation_iterator.h"
#include "thrust/iterator/transform_iterator.h"
#include "thrust/copy.h"
#include "thrust/functional.h"

using namespace thrust::placeholders;

int main()
{
    const int oldChunk = 100, newChunk = 95;
    const int size = 10000;

    thrust::device_vector<float> v(
            thrust::counting_iterator<float>(0),
            thrust::counting_iterator<float>(0) + oldChunk * size);
    thrust::device_vector<float> buf(newChunk * size);

    thrust::copy(
            thrust::make_permutation_iterator(
                    v.begin(),
                    thrust::make_transform_iterator(
                            thrust::counting_iterator<int>(0),
                            _1 / newChunk * oldChunk + _1 % newChunk)),
            thrust::make_permutation_iterator(
                    v.begin(),
                    thrust::make_transform_iterator(
                            thrust::counting_iterator<int>(0),
                            _1 / newChunk * oldChunk + _1 % newChunk))
                    + buf.size(),
            buf.begin());

    return 0;
}

I think the above version may not achieve the highest performance due to the use of mod operator %. For higher performance you may consider the cuBLAS function cublas_geam()

float alpha = 1;
float beta = 0;
cublasSgeam(handle, CUBLAS_OP_N, CUBLAS_OP_N,
            newChunk, size,
            &alpha,
            thrust::raw_pointer_cast(&v[0]), oldChunk,
            &beta,
            thrust::raw_pointer_cast(&v[0]), oldChunk,
            thrust::raw_pointer_cast(&buf[0]), newChunk);
kangshiyin
  • 9,681
  • 1
  • 17
  • 29