Programming models of different hardware

Question

I'm really not sure if this is the right place to ask. I'm interested in the different programming models of different types of hardware.

It starts off like this, I was presenting some work I was doing w/ NVIDIA CUDA. I was telling people that one of the main issues with using a GPU as a coprocessor is the fact that you have to transfer data to and from the host to the GPU. Several people then proceeded to question me about the AMD "APUs", and the fact that the graphics cores are on the same die as the regular CPU cores.

I dodged the questions by pointing out that the Intel/AMD CPU+GPU chips will never contain as many graphics cores as the dedicated NVIDIA cards.

The thing is, I don't really know what the programming models are for the AMD APUs or the Intel Sandy/Ivy Bridge chips.

My questions are:

How are programs written to take advantage of graphics cores on the AMD/Intel chips?
Can these graphics cores really access host memory directly?
Is there any information about the kind of performance of these chips, in SP and DP FLOPS?
Coming from CUDA, what similarities can be found between programming for NVIDIA GPUs and the other chips in question?
How did the Cell processor's SPEs access the memory, or how did its programming model compare to these Intel/AMD chips today?

"The Intel/AMD CPU+GPU chips will never contain as many graphics cores as the dedicated NVIDIA cards." Never? Maybe not today, but there is no technical reason why a GPU integrated into a CPU could not have the same number of cores as a dedicated GPU. — vocaro, Nov 28 '11 at 21:16

score 2 · Answer 1 · answered Nov 25 '11 at 11:53

How are programs written to take advantage of graphics cores on the AMD/Intel chips?

OpenCL, but I don't think Intel have done the work to use the graphics cores.

Can these graphics cores really access host memory directly?

Yes, but there's a couple of caveats.

Whilst the bandwidth to host memory is better than over PCI-e, it's not as much as a GPU has to graphics memory (3-4x difference).
OpenCL might require it's own copy of the data in some circumstances. For a GPU this has to happen (Host Mem -> Graphics Mem), for an APU you want to try to make sure it doesn't. This pretty much comes down to how you allocate your buffers as I understand it.

Basically you've changed the terms of the compromise. It used to be that the start-up cost (copying data to graphics memory) was significant enough that work items needed to be big enough to make it worth while sending something to the GPU. That cost has now come down (no copy), but the performance on the cores is lower (fewer of them, and lower memory bandwidth).

It's an interesting development which probably makes GPGPU techniques worthwhile in more situations, but without such HUGE gains. The gains will still be large though.

Is there any information about the kind of performance of these chips, in SP and DP FLOPS?

I'm loathed to repeat marketing numbers, but an AMD A8-3850 has a headline figure of 480 GFLOPS

Coming from CUDA, what similarities can be found between programming for NVIDIA GPUs and the other chips in question?

I've not used CUDA, so someone else may want to answer, but my understanding is that CUDA and OpenCL have a lot of the same concepts (memory models, kernels, etc), but CUDA does bring some stuff to the party that OpenCL doesn't (C++-isms)

Then there are architectural differences between Nvidia and AMD, with the main one being that Nvidia's cores are scaler, and AMDs are vector, so to get best performance on AMD you need to write vectored code.

score 1 · Answer 2 · answered Nov 25 '11 at 10:57

I only have experience with CUDA and the answer is based on that experience and some things I just did a quick search (I wanted to know some answers too).

I think they are written the same way. You can use OpenCL in them all and even though there are differences in the hardware implementation they follow the same principles.
I don't know how it is for AMD and Intel but I would say yes. You can do it with CUDA. Using mapped page-locked host memory you can access memory on the host directly from a kernel. NVIDIA even recommends using memory this way if you have an integrated NVIDIA system (section 5.3.1 of the CUDA C programming guide).
Yes. For Intel check Intel HD Graphics DirectX Developer's Guide (Sandy Bridge) on page 11 (125GFlops max for Intel HD3000). For AMD they put some value on the specifications page of each card, example AMD Radeon HD6990. You probably can find a comparison somewhere.
As I said I think the programming models are similar, OpenCL also has the notion of a kernel, host and device memory and identifiers for threads and working groups (just some examples). To maximize performance you need to know something about the specific architectures but you can work with all using similar approaches.
No idea...

score -1 · Answer 3 · edited Dec 27 '16 at 11:40

I have done work in OpenCL with Bigdata.

How are programs written to take advantage of graphics cores on the AMD/Intel chips?

OpenCL is a low-level programming model which works on heterogeneous environment. It is built to use all computational resources in a system like; CPUs, GPUs, APUs, FPGAs etc. OpenCL code programs also called kernels which run on GPUs and cores of CPU.

Although Intel is mostly known for their processors not the GPU but they also provide GPUs for a pretty long time now,like the intel GMA and later the intel HD Graphics.

Programming models of different hardware

3 Answers3