0

All the examples perform scans on arrays sized by some multiple of 32. The quickest examples use 256 or more threads with 4 or more elements assigned to each thread.

This means, that if I had an array of size 450, then, presumably, I would have to pad it out to 512 and do 256 threads assigned 2 elements each.

However, in my particular instance, it is not feasible to have to pad out each array.

Is there an alternative solution to handle multiple oddly sized arrays? Is there a way to somehow specify a width?


Ok, lets be more clear. This is a simplified example. Say I have 2 arrays, one array is simply a list of integer offsets into the second array, which contains the data. The offsets indicate the beginning of a separate set of data.

two arrays one as offsets into the second

Each set of data is randomly sized. I get the data as a chunk from some other process, so there is no easy way to pad them. I want to run BlockScan on each offset from the same kernel.

jpreed00
  • 893
  • 8
  • 25

1 Answers1

2

Let your index (offset) array be idx[]. Let your data array be A[], let the result of the scan be in B[].

  1. Scan the whole array A[], storing the output in B[].

  2. For each element at idx[i], go to that index minus 1 in B[], retrieve that value, then use the element at idx[i-1] to index minus 1 in B[] and subtract that value, then subtract the result from the same index idx[i] (not minus 1) in A[].

  3. Rescan A to B.

As a simple example:

idx: 0 2 5

0:  1  1  1  1  1  1  1  1
1:  1  2  3  4  5  6  7  8
2:  1  1 -1  1  1 -2  1  1
3:  1  2  1  2  3  1  2  3

In the above example, the -1 in step 2 is computed as the scan value in step 1 at index (2-1) minus the scan value in step 1 at index (0-1) (assumed to be zero) which is then subtracted from the original data value. The -2 in step 2 is computed as the scan value in step 1 at index (5-1) minus the scan value in step 1 at index (2-1), subtracted from the original data value.

Here is an example:

$ cat t453.cu
#include <cub/cub.cuh>
#include <iostream>

template <int TPB, int IPT, typename T>
__global__ void k(T *data, int *idx, int n){

    // Specialize BlockScan for a 1D block of TPB threads on type T
    __shared__ T sdata[TPB*IPT*2];
    sdata[threadIdx.x*IPT] = 1;
    __syncthreads();
    typedef cub::BlockScan<T, TPB> BlockScan;
    // Allocate shared memory for BlockScan
    __shared__ typename BlockScan::TempStorage temp_storage;
    // Obtain a segment of consecutive items that are blocked across threads
    int thread_data[IPT];
    thread_data[0] = sdata[threadIdx.x*IPT];
    // Collectively compute the block-wide exclusive prefix sum
    BlockScan(temp_storage).InclusiveSum(thread_data, thread_data);
    __syncthreads();
    sdata[IPT*(threadIdx.x+TPB)] = thread_data[0];
    if ((threadIdx.x < n) && (threadIdx.x > 0)) // assume the first element if idx points to 0
      sdata[idx[threadIdx.x]*IPT] -= (sdata[((idx[threadIdx.x]-1)+TPB)*IPT] - ((threadIdx.x == 1)?0:sdata[((idx[threadIdx.x-1]-1)+TPB)*IPT]));
    __syncthreads();
    thread_data[0] = sdata[threadIdx.x*IPT];
    BlockScan(temp_storage).InclusiveSum(thread_data, thread_data);
    __syncthreads();
    data[threadIdx.x] = thread_data[0];
}

typedef int dtype;
const int nTPB = 256;

int main(){
  int h_idx[] = {0, 4, 7, 32, 55, 99, 104, 200};
  int n = sizeof(h_idx)/sizeof(h_idx[0]);
  std::cout << "n = " << n << std::endl;
  int *d_idx;
  cudaMalloc(&d_idx, n*sizeof(d_idx[0]));
  cudaMemcpy(d_idx, h_idx, n*sizeof(h_idx[0]), cudaMemcpyHostToDevice);
  dtype *h_data, *d_data;
  h_data = new dtype[nTPB];
  cudaMalloc(&d_data, nTPB*sizeof(dtype));
  k<nTPB, 1><<<1,nTPB>>>(d_data, d_idx, n);
  cudaMemcpy(h_data, d_data, nTPB*sizeof(dtype), cudaMemcpyDeviceToHost);
  dtype sum;
  int idx = 0;
  for (int i = 0; i < nTPB; i++){
    if (i == h_idx[idx]) {sum = 0; idx++;}
    sum++;
    std::cout << "gpu: " << h_data[i] << " cpu: " << sum << std::endl;
  }
}
$ nvcc -o t453 t453.cu
$ cuda-memcheck ./t453
========= CUDA-MEMCHECK
n = 8
gpu: 1 cpu: 1
gpu: 2 cpu: 2
gpu: 3 cpu: 3
gpu: 4 cpu: 4
gpu: 1 cpu: 1
gpu: 2 cpu: 2
gpu: 3 cpu: 3
gpu: 1 cpu: 1
gpu: 2 cpu: 2
gpu: 3 cpu: 3
gpu: 4 cpu: 4
gpu: 5 cpu: 5
gpu: 6 cpu: 6
gpu: 7 cpu: 7
gpu: 8 cpu: 8
gpu: 9 cpu: 9
gpu: 10 cpu: 10
gpu: 11 cpu: 11
gpu: 12 cpu: 12
gpu: 13 cpu: 13
gpu: 14 cpu: 14
gpu: 15 cpu: 15
gpu: 16 cpu: 16
gpu: 17 cpu: 17
gpu: 18 cpu: 18
gpu: 19 cpu: 19
gpu: 20 cpu: 20
gpu: 21 cpu: 21
gpu: 22 cpu: 22
gpu: 23 cpu: 23
gpu: 24 cpu: 24
gpu: 25 cpu: 25
gpu: 1 cpu: 1
gpu: 2 cpu: 2
gpu: 3 cpu: 3
gpu: 4 cpu: 4
gpu: 5 cpu: 5
gpu: 6 cpu: 6
gpu: 7 cpu: 7
gpu: 8 cpu: 8
gpu: 9 cpu: 9
gpu: 10 cpu: 10
gpu: 11 cpu: 11
gpu: 12 cpu: 12
gpu: 13 cpu: 13
gpu: 14 cpu: 14
gpu: 15 cpu: 15
gpu: 16 cpu: 16
gpu: 17 cpu: 17
gpu: 18 cpu: 18
gpu: 19 cpu: 19
gpu: 20 cpu: 20
gpu: 21 cpu: 21
gpu: 22 cpu: 22
gpu: 23 cpu: 23
gpu: 1 cpu: 1
gpu: 2 cpu: 2
gpu: 3 cpu: 3
gpu: 4 cpu: 4
gpu: 5 cpu: 5
gpu: 6 cpu: 6
gpu: 7 cpu: 7
gpu: 8 cpu: 8
gpu: 9 cpu: 9
gpu: 10 cpu: 10
gpu: 11 cpu: 11
gpu: 12 cpu: 12
gpu: 13 cpu: 13
gpu: 14 cpu: 14
gpu: 15 cpu: 15
gpu: 16 cpu: 16
gpu: 17 cpu: 17
gpu: 18 cpu: 18
gpu: 19 cpu: 19
gpu: 20 cpu: 20
gpu: 21 cpu: 21
gpu: 22 cpu: 22
gpu: 23 cpu: 23
gpu: 24 cpu: 24
gpu: 25 cpu: 25
gpu: 26 cpu: 26
gpu: 27 cpu: 27
gpu: 28 cpu: 28
gpu: 29 cpu: 29
gpu: 30 cpu: 30
gpu: 31 cpu: 31
gpu: 32 cpu: 32
gpu: 33 cpu: 33
gpu: 34 cpu: 34
gpu: 35 cpu: 35
gpu: 36 cpu: 36
gpu: 37 cpu: 37
gpu: 38 cpu: 38
gpu: 39 cpu: 39
gpu: 40 cpu: 40
gpu: 41 cpu: 41
gpu: 42 cpu: 42
gpu: 43 cpu: 43
gpu: 44 cpu: 44
gpu: 1 cpu: 1
gpu: 2 cpu: 2
gpu: 3 cpu: 3
gpu: 4 cpu: 4
gpu: 5 cpu: 5
gpu: 1 cpu: 1
gpu: 2 cpu: 2
gpu: 3 cpu: 3
gpu: 4 cpu: 4
gpu: 5 cpu: 5
gpu: 6 cpu: 6
gpu: 7 cpu: 7
gpu: 8 cpu: 8
gpu: 9 cpu: 9
gpu: 10 cpu: 10
gpu: 11 cpu: 11
gpu: 12 cpu: 12
gpu: 13 cpu: 13
gpu: 14 cpu: 14
gpu: 15 cpu: 15
gpu: 16 cpu: 16
gpu: 17 cpu: 17
gpu: 18 cpu: 18
gpu: 19 cpu: 19
gpu: 20 cpu: 20
gpu: 21 cpu: 21
gpu: 22 cpu: 22
gpu: 23 cpu: 23
gpu: 24 cpu: 24
gpu: 25 cpu: 25
gpu: 26 cpu: 26
gpu: 27 cpu: 27
gpu: 28 cpu: 28
gpu: 29 cpu: 29
gpu: 30 cpu: 30
gpu: 31 cpu: 31
gpu: 32 cpu: 32
gpu: 33 cpu: 33
gpu: 34 cpu: 34
gpu: 35 cpu: 35
gpu: 36 cpu: 36
gpu: 37 cpu: 37
gpu: 38 cpu: 38
gpu: 39 cpu: 39
gpu: 40 cpu: 40
gpu: 41 cpu: 41
gpu: 42 cpu: 42
gpu: 43 cpu: 43
gpu: 44 cpu: 44
gpu: 45 cpu: 45
gpu: 46 cpu: 46
gpu: 47 cpu: 47
gpu: 48 cpu: 48
gpu: 49 cpu: 49
gpu: 50 cpu: 50
gpu: 51 cpu: 51
gpu: 52 cpu: 52
gpu: 53 cpu: 53
gpu: 54 cpu: 54
gpu: 55 cpu: 55
gpu: 56 cpu: 56
gpu: 57 cpu: 57
gpu: 58 cpu: 58
gpu: 59 cpu: 59
gpu: 60 cpu: 60
gpu: 61 cpu: 61
gpu: 62 cpu: 62
gpu: 63 cpu: 63
gpu: 64 cpu: 64
gpu: 65 cpu: 65
gpu: 66 cpu: 66
gpu: 67 cpu: 67
gpu: 68 cpu: 68
gpu: 69 cpu: 69
gpu: 70 cpu: 70
gpu: 71 cpu: 71
gpu: 72 cpu: 72
gpu: 73 cpu: 73
gpu: 74 cpu: 74
gpu: 75 cpu: 75
gpu: 76 cpu: 76
gpu: 77 cpu: 77
gpu: 78 cpu: 78
gpu: 79 cpu: 79
gpu: 80 cpu: 80
gpu: 81 cpu: 81
gpu: 82 cpu: 82
gpu: 83 cpu: 83
gpu: 84 cpu: 84
gpu: 85 cpu: 85
gpu: 86 cpu: 86
gpu: 87 cpu: 87
gpu: 88 cpu: 88
gpu: 89 cpu: 89
gpu: 90 cpu: 90
gpu: 91 cpu: 91
gpu: 92 cpu: 92
gpu: 93 cpu: 93
gpu: 94 cpu: 94
gpu: 95 cpu: 95
gpu: 96 cpu: 96
gpu: 1 cpu: 1
gpu: 2 cpu: 2
gpu: 3 cpu: 3
gpu: 4 cpu: 4
gpu: 5 cpu: 5
gpu: 6 cpu: 6
gpu: 7 cpu: 7
gpu: 8 cpu: 8
gpu: 9 cpu: 9
gpu: 10 cpu: 10
gpu: 11 cpu: 11
gpu: 12 cpu: 12
gpu: 13 cpu: 13
gpu: 14 cpu: 14
gpu: 15 cpu: 15
gpu: 16 cpu: 16
gpu: 17 cpu: 17
gpu: 18 cpu: 18
gpu: 19 cpu: 19
gpu: 20 cpu: 20
gpu: 21 cpu: 21
gpu: 22 cpu: 22
gpu: 23 cpu: 23
gpu: 24 cpu: 24
gpu: 25 cpu: 25
gpu: 26 cpu: 26
gpu: 27 cpu: 27
gpu: 28 cpu: 28
gpu: 29 cpu: 29
gpu: 30 cpu: 30
gpu: 31 cpu: 31
gpu: 32 cpu: 32
gpu: 33 cpu: 33
gpu: 34 cpu: 34
gpu: 35 cpu: 35
gpu: 36 cpu: 36
gpu: 37 cpu: 37
gpu: 38 cpu: 38
gpu: 39 cpu: 39
gpu: 40 cpu: 40
gpu: 41 cpu: 41
gpu: 42 cpu: 42
gpu: 43 cpu: 43
gpu: 44 cpu: 44
gpu: 45 cpu: 45
gpu: 46 cpu: 46
gpu: 47 cpu: 47
gpu: 48 cpu: 48
gpu: 49 cpu: 49
gpu: 50 cpu: 50
gpu: 51 cpu: 51
gpu: 52 cpu: 52
gpu: 53 cpu: 53
gpu: 54 cpu: 54
gpu: 55 cpu: 55
gpu: 56 cpu: 56
========= ERROR SUMMARY: 0 errors
$

This still requires you to pad the "end" of your array to the threadblock size. I'm assuming that should be possible based on your description, its basically necessary for cub anyway; cub expects to use every thread in your threadblock.

For larger arrays, the above method could be extended in a straightforward fashion to use DeviceScan. Step 1 is the first scan. Step 2 would be a separate kernel launch. Step 3 is the second scan.

If you want to have each threadblock perform a scan on a segment, you don't need to pad each segment. You only need to pad the "end" of the array so that the last scan will be OK, and even this "pad" operation can be accomplished with a conditional load, instead of an actual pad operation. Here's an example:

$ cat t455.cu
#include <cub/cub.cuh>
#include <iostream>

template <int TPB, int IPT, typename T>
__global__ void k(T *data, int *idx){
    int lidx = threadIdx.x;
    // Specialize BlockScan for a 1D block of TPB threads on type T
    typedef cub::BlockScan<T, TPB> BlockScan;
    // Allocate shared memory for BlockScan
    __shared__ typename BlockScan::TempStorage temp_storage;
    // Obtain a segment of consecutive items that are blocked across threads
    int thread_data[IPT];
    thread_data[0] = ((lidx+idx[blockIdx.x])>=idx[blockIdx.x+1])?0:data[lidx+idx[blockIdx.x]];
    // Collectively compute the block-wide inclusive prefix sum
    BlockScan(temp_storage).InclusiveSum(thread_data, thread_data);
    __syncthreads();
    if ((lidx+idx[blockIdx.x]) < idx[blockIdx.x+1])
      data[lidx+idx[blockIdx.x]] = thread_data[0];
}

typedef int dtype;
const int nTPB = 128; // sized with IPT to handle the largest segment
const int DS = 256;
int main(){
  int h_idx[] = {0, 4, 7, 32, 55, 99, 104, 200, 256};
  int n = sizeof(h_idx)/sizeof(h_idx[0]);
  std::cout << "n = " << n << std::endl;
  int *d_idx;
  cudaMalloc(&d_idx, n*sizeof(d_idx[0]));
  cudaMemcpy(d_idx, h_idx, n*sizeof(h_idx[0]), cudaMemcpyHostToDevice);
  dtype *h_data, *d_data;
  h_data = new dtype[DS];
  for (int i = 0; i < DS; i++) h_data[i] = 1;
  cudaMalloc(&d_data, DS*sizeof(dtype));
  cudaMemcpy(d_data, h_data, DS*sizeof(h_data[0]), cudaMemcpyHostToDevice);
  k<nTPB, 1><<<n-1,nTPB>>>(d_data, d_idx);
  cudaMemcpy(h_data, d_data, DS*sizeof(dtype), cudaMemcpyDeviceToHost);
  dtype sum;
  int idx = 0;
  for (int i = 0; i < DS; i++){
    if (i == h_idx[idx]) {sum = 0; idx++;}
    sum++;
    std::cout << "gpu: " << h_data[i] << " cpu: " << sum << std::endl;
  }
}
$ nvcc -o t455 t455.cu
$ cuda-memcheck ./t455
========= CUDA-MEMCHECK
n = 9
gpu: 1 cpu: 1
gpu: 2 cpu: 2
gpu: 3 cpu: 3
gpu: 4 cpu: 4
gpu: 1 cpu: 1
gpu: 2 cpu: 2
gpu: 3 cpu: 3
gpu: 1 cpu: 1
gpu: 2 cpu: 2
gpu: 3 cpu: 3
gpu: 4 cpu: 4
gpu: 5 cpu: 5
gpu: 6 cpu: 6
gpu: 7 cpu: 7
gpu: 8 cpu: 8
gpu: 9 cpu: 9
gpu: 10 cpu: 10
gpu: 11 cpu: 11
gpu: 12 cpu: 12
gpu: 13 cpu: 13
gpu: 14 cpu: 14
gpu: 15 cpu: 15
gpu: 16 cpu: 16
gpu: 17 cpu: 17
gpu: 18 cpu: 18
gpu: 19 cpu: 19
gpu: 20 cpu: 20
gpu: 21 cpu: 21
gpu: 22 cpu: 22
gpu: 23 cpu: 23
gpu: 24 cpu: 24
gpu: 25 cpu: 25
gpu: 1 cpu: 1
gpu: 2 cpu: 2
gpu: 3 cpu: 3
gpu: 4 cpu: 4
gpu: 5 cpu: 5
gpu: 6 cpu: 6
gpu: 7 cpu: 7
gpu: 8 cpu: 8
gpu: 9 cpu: 9
gpu: 10 cpu: 10
gpu: 11 cpu: 11
gpu: 12 cpu: 12
gpu: 13 cpu: 13
gpu: 14 cpu: 14
gpu: 15 cpu: 15
gpu: 16 cpu: 16
gpu: 17 cpu: 17
gpu: 18 cpu: 18
gpu: 19 cpu: 19
gpu: 20 cpu: 20
gpu: 21 cpu: 21
gpu: 22 cpu: 22
gpu: 23 cpu: 23
gpu: 1 cpu: 1
gpu: 2 cpu: 2
gpu: 3 cpu: 3
gpu: 4 cpu: 4
gpu: 5 cpu: 5
gpu: 6 cpu: 6
gpu: 7 cpu: 7
gpu: 8 cpu: 8
gpu: 9 cpu: 9
gpu: 10 cpu: 10
gpu: 11 cpu: 11
gpu: 12 cpu: 12
gpu: 13 cpu: 13
gpu: 14 cpu: 14
gpu: 15 cpu: 15
gpu: 16 cpu: 16
gpu: 17 cpu: 17
gpu: 18 cpu: 18
gpu: 19 cpu: 19
gpu: 20 cpu: 20
gpu: 21 cpu: 21
gpu: 22 cpu: 22
gpu: 23 cpu: 23
gpu: 24 cpu: 24
gpu: 25 cpu: 25
gpu: 26 cpu: 26
gpu: 27 cpu: 27
gpu: 28 cpu: 28
gpu: 29 cpu: 29
gpu: 30 cpu: 30
gpu: 31 cpu: 31
gpu: 32 cpu: 32
gpu: 33 cpu: 33
gpu: 34 cpu: 34
gpu: 35 cpu: 35
gpu: 36 cpu: 36
gpu: 37 cpu: 37
gpu: 38 cpu: 38
gpu: 39 cpu: 39
gpu: 40 cpu: 40
gpu: 41 cpu: 41
gpu: 42 cpu: 42
gpu: 43 cpu: 43
gpu: 44 cpu: 44
gpu: 1 cpu: 1
gpu: 2 cpu: 2
gpu: 3 cpu: 3
gpu: 4 cpu: 4
gpu: 5 cpu: 5
gpu: 1 cpu: 1
gpu: 2 cpu: 2
gpu: 3 cpu: 3
gpu: 4 cpu: 4
gpu: 5 cpu: 5
gpu: 6 cpu: 6
gpu: 7 cpu: 7
gpu: 8 cpu: 8
gpu: 9 cpu: 9
gpu: 10 cpu: 10
gpu: 11 cpu: 11
gpu: 12 cpu: 12
gpu: 13 cpu: 13
gpu: 14 cpu: 14
gpu: 15 cpu: 15
gpu: 16 cpu: 16
gpu: 17 cpu: 17
gpu: 18 cpu: 18
gpu: 19 cpu: 19
gpu: 20 cpu: 20
gpu: 21 cpu: 21
gpu: 22 cpu: 22
gpu: 23 cpu: 23
gpu: 24 cpu: 24
gpu: 25 cpu: 25
gpu: 26 cpu: 26
gpu: 27 cpu: 27
gpu: 28 cpu: 28
gpu: 29 cpu: 29
gpu: 30 cpu: 30
gpu: 31 cpu: 31
gpu: 32 cpu: 32
gpu: 33 cpu: 33
gpu: 34 cpu: 34
gpu: 35 cpu: 35
gpu: 36 cpu: 36
gpu: 37 cpu: 37
gpu: 38 cpu: 38
gpu: 39 cpu: 39
gpu: 40 cpu: 40
gpu: 41 cpu: 41
gpu: 42 cpu: 42
gpu: 43 cpu: 43
gpu: 44 cpu: 44
gpu: 45 cpu: 45
gpu: 46 cpu: 46
gpu: 47 cpu: 47
gpu: 48 cpu: 48
gpu: 49 cpu: 49
gpu: 50 cpu: 50
gpu: 51 cpu: 51
gpu: 52 cpu: 52
gpu: 53 cpu: 53
gpu: 54 cpu: 54
gpu: 55 cpu: 55
gpu: 56 cpu: 56
gpu: 57 cpu: 57
gpu: 58 cpu: 58
gpu: 59 cpu: 59
gpu: 60 cpu: 60
gpu: 61 cpu: 61
gpu: 62 cpu: 62
gpu: 63 cpu: 63
gpu: 64 cpu: 64
gpu: 65 cpu: 65
gpu: 66 cpu: 66
gpu: 67 cpu: 67
gpu: 68 cpu: 68
gpu: 69 cpu: 69
gpu: 70 cpu: 70
gpu: 71 cpu: 71
gpu: 72 cpu: 72
gpu: 73 cpu: 73
gpu: 74 cpu: 74
gpu: 75 cpu: 75
gpu: 76 cpu: 76
gpu: 77 cpu: 77
gpu: 78 cpu: 78
gpu: 79 cpu: 79
gpu: 80 cpu: 80
gpu: 81 cpu: 81
gpu: 82 cpu: 82
gpu: 83 cpu: 83
gpu: 84 cpu: 84
gpu: 85 cpu: 85
gpu: 86 cpu: 86
gpu: 87 cpu: 87
gpu: 88 cpu: 88
gpu: 89 cpu: 89
gpu: 90 cpu: 90
gpu: 91 cpu: 91
gpu: 92 cpu: 92
gpu: 93 cpu: 93
gpu: 94 cpu: 94
gpu: 95 cpu: 95
gpu: 96 cpu: 96
gpu: 1 cpu: 1
gpu: 2 cpu: 2
gpu: 3 cpu: 3
gpu: 4 cpu: 4
gpu: 5 cpu: 5
gpu: 6 cpu: 6
gpu: 7 cpu: 7
gpu: 8 cpu: 8
gpu: 9 cpu: 9
gpu: 10 cpu: 10
gpu: 11 cpu: 11
gpu: 12 cpu: 12
gpu: 13 cpu: 13
gpu: 14 cpu: 14
gpu: 15 cpu: 15
gpu: 16 cpu: 16
gpu: 17 cpu: 17
gpu: 18 cpu: 18
gpu: 19 cpu: 19
gpu: 20 cpu: 20
gpu: 21 cpu: 21
gpu: 22 cpu: 22
gpu: 23 cpu: 23
gpu: 24 cpu: 24
gpu: 25 cpu: 25
gpu: 26 cpu: 26
gpu: 27 cpu: 27
gpu: 28 cpu: 28
gpu: 29 cpu: 29
gpu: 30 cpu: 30
gpu: 31 cpu: 31
gpu: 32 cpu: 32
gpu: 33 cpu: 33
gpu: 34 cpu: 34
gpu: 35 cpu: 35
gpu: 36 cpu: 36
gpu: 37 cpu: 37
gpu: 38 cpu: 38
gpu: 39 cpu: 39
gpu: 40 cpu: 40
gpu: 41 cpu: 41
gpu: 42 cpu: 42
gpu: 43 cpu: 43
gpu: 44 cpu: 44
gpu: 45 cpu: 45
gpu: 46 cpu: 46
gpu: 47 cpu: 47
gpu: 48 cpu: 48
gpu: 49 cpu: 49
gpu: 50 cpu: 50
gpu: 51 cpu: 51
gpu: 52 cpu: 52
gpu: 53 cpu: 53
gpu: 54 cpu: 54
gpu: 55 cpu: 55
gpu: 56 cpu: 56
========= ERROR SUMMARY: 0 errors
$
Robert Crovella
  • 143,785
  • 11
  • 213
  • 257
  • Thanks for posting a response. The data I am dealing with is very large. An approach like this might be feasible, but I would still have to break up the data in some fashion. That might be the only course of action, though -- perform a BlockScan and then fix the overage. – jpreed00 Apr 27 '19 at 14:12
  • You specifically asked about using `BlockScan`. That only works at the threadblock level. And the actual numbers you gave in your description were sensible for block-sized work (450/512). I can only work with the information you provide. An equivalent method can be done with [`DeviceScan`](https://nvlabs.github.io/cub/structcub_1_1_device_scan.html), where you would do a device wide scan, then fixup the result (via a kernel launch) using the exact same process as described here, then rescan. – Robert Crovella Apr 27 '19 at 14:16
  • Yeah, the individual arrays are appropriate for BlockScan, which is why I was hoping there was some manner in which to specify a width or limit. The whole set of data is probably too large. Performing a DeviceScan and then launching a kernel to fix them might be feasible, too. – jpreed00 Apr 27 '19 at 14:27
  • 1
    If you just want to have your threadblock walk through an array, doing an individual scan on each segment of your data, that is a trivial adaptation of the standard sample code. cub blockscan only works with a thread array that is local to each thread anyway, and you must load that thread array, so just do a conditional load based on the length of the segment. You don't really even have to do a conditional load anyway. Doing a segment scan on extra data is not going to corrupt your results. Just be careful when you write the data back to only write the length of the segment. – Robert Crovella Apr 27 '19 at 14:28
  • Oh! That's interesting, I had not considered that. But you are obviously correct. That will probably work. – jpreed00 Apr 27 '19 at 14:35
  • 1
    I've updated my answer showing how to do a block scan per segment, without any padding. – Robert Crovella Apr 27 '19 at 14:51