Questions tagged [gpgpu]

GPGPU is an acronym for the field of computer science known as "General Purpose computing on the Graphics Processing Unit (GPU)"

GPGPU is an acronym for the field of computer science known as "General Purpose computing on the Graphics Processing Unit (GPU)". The two biggest manufacturers of GPUs are NVIDIA and AMD, although Intel has recently been moving in this direction with the Haswell APUs . There are two popular frameworks for GPGPU - NVidia's CUDA, which is only supported on its own hardware, and OpenCL developed by the Khronos Group. The latter is a consortium including all of AMD, NVidia, Intel, Apple and others, but the OpenCL standard is only half-heartedly supported by NVidia - creating a partial reflection of the rivalry among GPU manufacturers in the rivalry of programming frameworks.

The attractiveness of using GPUs for other tasks largely stems from the parallel processing capabilities of many modern graphics cards. Some cards can have thousands of streams processing similar data at incredible rates.

In the past, CPUs first emulated threading/multiple data streams through interpolation of processing tasks. Over time, we gained multiple cores with multiple threads. Now video cards house a number of GPUs, hosting many more threads or streams than many CPUs, and extremely fast memory integrated together. This huge increase of threads in execution is achieved thanks to the technique SIMD which stands for Single Instruction Multiple Data. This makes an environment uniquely suited for heavy computational loads that are able to undergo parallelization. Furthermore this technique also marks one of main differences between GPUs and CPUs as they are doing best what they were designed for.

More information at http://en.wikipedia.org/wiki/GPGPU

2243 questions
9
votes
3 answers

Generating random number within Cuda kernel in a varying range

I am trying to generate random number random numbers within the cuda kernel. I wish to generate the random numbers from uniform distribution and in the integer form, starting from 1 up to 8. The random numbers would be different for each of the…
duttasankha
  • 717
  • 2
  • 10
  • 32
9
votes
1 answer

How to debug OpenCL on Nvidia GPUs?

Is there any way to debug OpenCL kernels on an Nvidia GPU, i.e. set breakpoints and inspect variables? My understanding is that Nvidia's tool does not allow OpenCL debugging, and AMD's and Intel's only allow it on their own devices.
1''
  • 26,823
  • 32
  • 143
  • 200
9
votes
2 answers

Is it possible to bind a OpenCV GpuMat as an OpenGL texture?

I haven't been able to find any reference except for: http://answers.opencv.org/question/9512/how-to-bind-gpumat-to-texture/ which discusses a CUDA approach. Ideally I'd like to update an OpenGL texture with the contents of a cv::gpu::GpuMat without…
Elliot Woods
  • 834
  • 11
  • 20
9
votes
3 answers

GPU Accelerated XML Parsing

I need to improve the performance of a piece of software that parses XML files and adds their contents to a large SQL Database. I have been trying to find information about whether or not it is possible to implement this on a GPU. My research…
Catachan
  • 190
  • 2
  • 12
9
votes
1 answer

How to avoid default construction of elements in thrust::device_vector?

It seems when creating a new Thrust vector all elements are 0 by default - I just want to confirm that this will always be the case. If so, is there also a way to bypass the constructor responsible for this behavior for additional speed (since for…
mchen
  • 9,808
  • 17
  • 72
  • 125
9
votes
2 answers

Persistent threads in OpenCL and CUDA

I have read some papers talking about "persistent threads" for GPGPU, but I don't really understand it. Can any one give me an example or show me the use of this programming fashion? What I keep in my mind after reading and googling "persistent…
AmineMs
  • 133
  • 1
  • 7
9
votes
1 answer

Large matrix multiplication on gpu

I need to implement a matrix multiplication on GPU with CUDA for large matrices. Size of each matrix alone is bigger than the GPU memory. So I think I need an algorithm to do that efficiently. I went around the internet but couldn't find any. Can…
Soroosh Khoram
  • 417
  • 1
  • 7
  • 11
9
votes
2 answers

Shared memory bandwidth Fermi vs Kepler GPU

Has Kepler two times or four times the bandwidth of Fermi while accessing shared memory? The Programming Guide states: Each bank has a bandwidth of 32 bits per two clock cycles for 2.X, and Each bank has a bandwidth of 64 bits per clock…
P Marecki
  • 1,108
  • 15
  • 19
9
votes
3 answers

OpenCL FFT on both Nvidia and AMD hardware?

I'm working on a project that needs to make use of FFTs on both Nvidia and AMD graphics cards. I initially looked for a library that would work on both (thinking this would be the OpenCL way) but I wasn't having any luck. Someone suggested to me…
Lorentz
  • 91
  • 1
  • 2
8
votes
3 answers

physical memory on AMD devices: local vs private

I'm writing an algorithm in OpenCL in which I'd need every work unit to remember a fair portion of data, say something between a long[70] and a long[200] or so per kernel. Recent AMD devices have 32 KiB __local memory, which is (for the given amount…
user1111929
  • 6,050
  • 9
  • 43
  • 73
8
votes
2 answers

OpenCL AMD vs NVIDIA performance

I implemented a simple kernel which is some sort of a convolution. I measured it on NVIDIA GT 240. It took 70 ms when written on CUDA and 100 ms when written on OpenCL. Ok, I thought, NVIDIA compiler is better optimized for CUDA (or I'm doing…
AdelNick
  • 982
  • 1
  • 8
  • 17
8
votes
3 answers

Fast rasterizing of text and vector art

Suppose there is a lot of vector shapes (Bezier curves which determine the boundary of a shape). For example a page full of tiny letters. What is the fastest way to create a bitmap out of it? I once saw a demo several years ago (can't find it now)…
Ecir Hana
  • 10,864
  • 13
  • 67
  • 117
8
votes
1 answer

What do work items execute when conditionals are used in GPU programming?

If you have work items executing in a wavefront and there is a conditional such as: if(x){ ... } else{ .... } What do the work-items execute? is it the case whereby all workitems in the wavefront will execute the first branch…
Roger
  • 3,411
  • 5
  • 23
  • 22
8
votes
1 answer

PyCUDA: Querying Device Status (Memory specifically)

PyCUDA's documentation mentions Driver Interface calls in passing, but I'm a bit think and can't see how to get information such as 'SHARED_SIZE_BYTES' out of my code. Can anyone point me to any examples of querying the device in this way? Is it…
Bolster
  • 7,460
  • 13
  • 61
  • 96
8
votes
1 answer

Is there an alternative to OpenCL+PyOpenCL for multiplatform GPGPU compute?

Support for OpenCL on Macs is going to end in macOS 10.15, so people invested in PyOpenCL+OpenCL as a means for doing general-purpose GPU (+CPU) compute will soon start to lose a key platform. So my questions are: Are there any viable…
Colin Stark
  • 301
  • 1
  • 10