Questions tagged [gpgpu]

GPGPU is an acronym for the field of computer science known as "General Purpose computing on the Graphics Processing Unit (GPU)"

GPGPU is an acronym for the field of computer science known as "General Purpose computing on the Graphics Processing Unit (GPU)". The two biggest manufacturers of GPUs are NVIDIA and AMD, although Intel has recently been moving in this direction with the Haswell APUs . There are two popular frameworks for GPGPU - NVidia's CUDA, which is only supported on its own hardware, and OpenCL developed by the Khronos Group. The latter is a consortium including all of AMD, NVidia, Intel, Apple and others, but the OpenCL standard is only half-heartedly supported by NVidia - creating a partial reflection of the rivalry among GPU manufacturers in the rivalry of programming frameworks.

The attractiveness of using GPUs for other tasks largely stems from the parallel processing capabilities of many modern graphics cards. Some cards can have thousands of streams processing similar data at incredible rates.

In the past, CPUs first emulated threading/multiple data streams through interpolation of processing tasks. Over time, we gained multiple cores with multiple threads. Now video cards house a number of GPUs, hosting many more threads or streams than many CPUs, and extremely fast memory integrated together. This huge increase of threads in execution is achieved thanks to the technique SIMD which stands for Single Instruction Multiple Data. This makes an environment uniquely suited for heavy computational loads that are able to undergo parallelization. Furthermore this technique also marks one of main differences between GPUs and CPUs as they are doing best what they were designed for.

More information at http://en.wikipedia.org/wiki/GPGPU

2243 questions
7
votes
1 answer

Allocating memory for data used by MTLBuffer in iOS Metal

As a follow-up question to this answer. I am trying to replace a for-loop running on CPU with a kernel function in Metal to parallelize computation and speed up performance. My function is basically a convolution. Since I repeatedly receive new data…
Maxi Mus
  • 795
  • 1
  • 6
  • 20
7
votes
4 answers

Is it possible to write OpenCL kernels in C++ rather than C?

I understand there's an openCL C++ API, but I'm having trouble compiling my kernels... do the kernels have to be written in C? And then it's just the host code that's allowed to be written in C++? Or is there some way to write the kernels in C++…
Elliot Gorokhovsky
  • 3,610
  • 2
  • 31
  • 56
7
votes
1 answer

Misaligned address in CUDA

Can anyone tell me whats wrong with the following code inside a CUDA kernel: __constant__ unsigned char MT[256] = { 0xde, 0x6f, 0x6f, 0xb1, 0xde, 0x6f, 0x6f, 0xb1, 0x91, 0xc5, 0xc5, 0x54, 0x91, 0xc5, 0xc5, 0x54,....}; typedef unsinged int…
Rezaeimh7
  • 1,467
  • 2
  • 23
  • 40
7
votes
2 answers

Use OpenCL on AMD APU but use discrete GPU for the X server

Is it possible to enable OpenCL on an A10-7800 without using it for the X server? I have a Linux box that I use for GPGPU programming. A discrete GEForce 740 card is used for both the X server and running OpenCL & Cuda programs I develop. I would…
Brad
  • 861
  • 5
  • 11
7
votes
4 answers

CUDA 7.5 installation: Unsupported compiler error

I just tried installing CUDA 7.5 on my laptop. I disabled lightdm and did sudo sh cuda7.5.run. The driver installation passed but then I got an error Unsupported compiler ... and the installation fails. How can I resolve this issue?
Amir
  • 10,600
  • 9
  • 48
  • 75
7
votes
2 answers

What will happen to the allocated memory on GPU, after the application using it exits, if cudaFree() was not used?

If cudaFree() is not used in the end, will the memory being used automatically get free, after the application/kernel function using it exits?
user4785313
7
votes
1 answer

Is there algorithm for sorting array of strings for GPU?

Array to sort has approximately one million strings, where every string can have length up to one million characters. I am looking for any implementation of sorting algorithm for GPU. I have a block of data with size approximately 1MB and I need to…
Kentzo
  • 3,881
  • 29
  • 54
7
votes
4 answers

Fast Fourier transforms on GPU on iOS

I am implementing compute intensive applications for iOS (i.e., iPhone or iPad) that heavily use fast Fourier transforms (and some signal processing operations such as interpolations and resampling). What are the best libraries and API that allows…
JustDoIt
  • 409
  • 4
  • 14
7
votes
2 answers

Why use SIMD if we have GPGPU?

Now that we have GPGPUs with languages like CUDA and OpenCL, do the multimedia SIMD extensions (SSE/AVX/NEON) still serve a purpose? I read an article recently about how SSE instructions could be used to accelerate sorting networks. I thought this…
jonfrazen1
  • 87
  • 1
  • 2
7
votes
1 answer

How to generate, compile and run CUDA kernels at runtime

Well, I have quite a delicate question :) Let's start with what I have: Data, large array of data, copied to GPU Program, generated by CPU (host), which needs to be evaluated for every data in that array The program changes very frequently, can be…
teejay
  • 2,353
  • 2
  • 27
  • 36
7
votes
1 answer

Create local array dynamic inside OpenCL kernel

I have a OpenCL kernel that needs to process a array as multiple arrays where each sub-array sum is saved in a local cache array. For example, imagine the fowling array: [[1, 2, 3, 4], [10, 30, 1, 23]] Each work-group gets a array (in the exemple…
jbatista
  • 964
  • 2
  • 11
  • 26
7
votes
5 answers

Does CUDA automatically load-balance for you?

I'm hoping for some general advice and clarification on best practices for load balancing in CUDA C, in particular: If 1 thread in a warp takes longer than the other 31, will it hold up the other 31 from completing? If so, will the spare processing…
mchen
  • 9,808
  • 17
  • 72
  • 125
7
votes
1 answer

PCI-e lane allocation on 2-GPU cards?

The data rate of cudaMemcpy operations is heavily influenced by the number of PCI-e 3.0 (or 2.0) lanes that are allocated to run from the CPU to GPU. I'm curious about how PCI-e lanes are used on Nvidia devices containing two GPUs. Nvidia has a few…
solvingPuzzles
  • 8,541
  • 16
  • 69
  • 112
7
votes
1 answer

PTX "bit bucket" registers

...are just mentioned in the PTX manual. There is no hint about what they are good for or how to use them. Does anyone know more? Am I just missing a common concept?
Dude
  • 583
  • 2
  • 9
7
votes
1 answer

Array of vectors using Thrust

Is it possible to create an array of device_vectors using Thrust? I know I can't create a device_vector of a device_vector, but how I would create an array of device_vectors?
Manolete
  • 3,431
  • 7
  • 54
  • 92