What is the canonical way to compare memory ranges in the CPU and in the GPU

Question

I have to contiguous ranges (pointer + size), one in the GPU and one in the CPU and I want to compare if they are equal.

What the canonical way to compare these ranges for equality?

my_cpu_type cpu;  // cpu.data() returns double*
my_gpu_type gpu;  // gpu.data() returns thrust::cuda::pointer<double>

thrust::equal(cpu.data(), cpu.data() + cpu.size(), gpu.data());

gives illegal memory access. I also tried

thrust::equal(
   thrust::cuda::par // also thrust::host
   , cpu.data(), cpu.data() + cpu.size(), gpu.data()
);

score 3 · Accepted Answer · answered Mar 17 '22 at 01:32

3

You can't do it the way you are imagining in the general case with thrust. Thrust does not execute algorithms in a mixed backend. You must either use the device backend, in which case all data needs to be on the device (or accessible from device code, see below), or else the host backend in which case all data needs to be on the host.

Therefore you will be forced to copy the data from one side to the other. The cost should be similar (copy host array to device, or device array to host) so we prefer to copy to the device, since the device comparison can be faster.

If you have the luxury of having the host array be in a pinned buffer, then it will be possible to do something like what you are suggesting.

For the general case, something like this should work:

thrust::host_vector<double>   cpu(size);
thrust::device_vector<double> gpu(size);

thrust::device_vector<double> d_cpu = cpu;
bool are_equal = thrust::equal(d_cpu.begin(), d_cpu.end(), gpu.begin());

answered Mar 17 '22 at 01:32

Robert Crovella

143,785
11
213
257

yes, I understand. After posting the question, I started wondering if it was the philosophy of the library not do mix operations (e.g. compare) between different memory spaces. And `copy` is just a necessary exception (wondering if there are others) (otherwise it would be impossible to do anything useful with GPU and CPU). – alfC Mar 17 '22 at 03:04
On the other hand, for my particular data structure, I have good experience with `cudaHostRegister` and I am tempted to use it for mixed algorithms. Yes, it works for POD values only but it covers many many use cases. This is for multidimensional arrays that because of stride my have a low density of holes in otherwise contiguous memory blocks. I wonder if I am overusing them and if there is a cost to it beyond the obvious. Maybe pinning memory interfeers with allocations (in the host or in the device), or something like that. – alfC Mar 17 '22 at 03:08
A remaining question, I guess, is, why thrust lets me compile code that will later fail at runtime. like `equal` comparing ranges in different memory locations. I guess thrust doesn't want to make assumptions on the location of raw pointers... intead of assuming that they are in the CPU. Is there a way to specify that certain memory is in the CPU via the type system? – alfC Mar 17 '22 at 03:27
1

If you have raw pointers, then you would specify an execution policy of `thrust::host` to indicate the algorithm should be performed using the host backend, or `thrust::device` to indicate the algorithm should be performed using the device backend. I'm fairly certain that if you pass raw pointers associated with pinned host memory, a thrust algorithm should work even if you specify `thrust::device`. That would be a potential way to "mix" things. If you use raw pointers and omit any execution policy, thrust assumes the pointers refer to host memory, and dispatches the op to the host backend. – Robert Crovella Mar 17 '22 at 03:58
So, isn't there a `thrust::host::pointer` type that can trigger (deduce) `thrust::host` policy? that would be good for genericity. – alfC Mar 17 '22 at 04:02
Yes, pinned host memory (through allocation or registation) allows effectively "mixing", I wonder if `cudaHostRegister` can be made part of `thrust` to allow (dynamic) systematic mixing (I guess it will have to be recursive and see through non-POD types.) – alfC Mar 17 '22 at 04:03
thrust has an [experimental pinned allocator](https://stackoverflow.com/questions/25064383/how-to-asynchronously-copy-memory-from-the-host-to-the-device-using-thrust-and-c/25068346#25068346), which can be used e.g. for creation of a "pinned host vector". – Robert Crovella Mar 17 '22 at 04:07
Thanks for the pointer, I am using it already for when copying a temporary buffer is the only remaining option to "mix" or even copy complicated data structures. – alfC Mar 17 '22 at 04:09
1

As I mentioned, raw pointers get automatically interpreted/"deduced" as host pointers, so they trigger [the thrust::host policy](https://thrust.github.io/doc/group__execution__policies.html). To do the opposite, there is an explicit `thrust::device_ptr` – Robert Crovella Mar 17 '22 at 04:11
Robert, your suggestion assumes there's enough memory on the GPU for a copy of the CPU-side vector. A safer assumption is that the GPU range can fit into main system memory. – einpoklum Mar 18 '22 at 13:28

einpoklum · Answer 2 · 2022-03-18T19:35:28.413

1

In addition to Robert's valid answer, I would claim you are following the wrong path in trying to employ C++-STL-like code where GPU computation is involved.

The issue is not merely that of where pointers point to. Something like std::equal is inherently sequential. Even if its implementation involves parallelism, the assumption is still of a computation which is to start ASAP, blocking the calling thread, and returning a result to that calling thread to continue its work. While it's possible this is what you want, I would guess that in most cases, it probably isn't. I believe thrust's approach, of making developers feel as though they're writing "C++ STL code, but with the GPU" is (mostly) misguided.

If there had been some integration of GPU task graphs, the C++ future/async/promise mechanism, and perhaps something like taskflow or other frameworks, that might have somehow become more of a "canonical" way to do this.

edited Mar 18 '22 at 19:35

answered Mar 18 '22 at 13:41

einpoklum

118,144
57
340
684

Thank you, the reason I wanted equal (involving the GPU) was for debugging. Your point is that you rarely need to compare for equality without a task to follow immediately. no? – alfC Mar 18 '22 at 16:24
You seem to have a completely different approach to GPU computation, but I don't understand what are the building blocks for your approach. Is it to write dependent tasks, usually defined as lambdas? Is there a chance that these task be written in terms of thrust algorithms (or equivalent). What would you do in this case? Do you think `thust::equal` should really be an asynchronous (CPU) function? or something beyond that? I wouldn't mind looking how to implement these common tasks with your library. – alfC Mar 18 '22 at 16:35
I think the difference is production code, which has to run as fast as possible (otherwise you would not use the GPU) and unit tests during development, which can run slow and with which you are willing to wait to block for intermediate results, but should be easy to use like a scripting language. – Sebastian Mar 18 '22 at 18:46
@alfC: My point, or one of my points, is that when you're ordering things to happen on the GPU with a bunch of buffers, they haven't happened yet, so it doesn't make sense to say "Possibly interrupting other work in the middle, and ignoring the fact that the contents of these buffers is a work in progress, get them to me and compare them. Right now." – einpoklum Mar 18 '22 at 20:01

What is the canonical way to compare memory ranges in the CPU and in the GPU

2 Answers2