Questions tagged [gpgpu]

GPGPU is an acronym for the field of computer science known as "General Purpose computing on the Graphics Processing Unit (GPU)"

GPGPU is an acronym for the field of computer science known as "General Purpose computing on the Graphics Processing Unit (GPU)". The two biggest manufacturers of GPUs are NVIDIA and AMD, although Intel has recently been moving in this direction with the Haswell APUs . There are two popular frameworks for GPGPU - NVidia's CUDA, which is only supported on its own hardware, and OpenCL developed by the Khronos Group. The latter is a consortium including all of AMD, NVidia, Intel, Apple and others, but the OpenCL standard is only half-heartedly supported by NVidia - creating a partial reflection of the rivalry among GPU manufacturers in the rivalry of programming frameworks.

The attractiveness of using GPUs for other tasks largely stems from the parallel processing capabilities of many modern graphics cards. Some cards can have thousands of streams processing similar data at incredible rates.

In the past, CPUs first emulated threading/multiple data streams through interpolation of processing tasks. Over time, we gained multiple cores with multiple threads. Now video cards house a number of GPUs, hosting many more threads or streams than many CPUs, and extremely fast memory integrated together. This huge increase of threads in execution is achieved thanks to the technique SIMD which stands for Single Instruction Multiple Data. This makes an environment uniquely suited for heavy computational loads that are able to undergo parallelization. Furthermore this technique also marks one of main differences between GPUs and CPUs as they are doing best what they were designed for.

More information at http://en.wikipedia.org/wiki/GPGPU

2243 questions
19
votes
2 answers

NVIDIA CUDA Video Encoder (NVCUVENC) input from device texture array

I am modifying CUDA Video Encoder (NVCUVENC) encoding sample found in SDK samples pack so that the data comes not from external yuv files (as is done in the sample ) but from cudaArray which is filled from texture. So the key API method that encodes…
Michael IV
  • 11,016
  • 12
  • 92
  • 223
18
votes
1 answer

The variation of cache misses in GPU

I have been toying an OpenCL kernel that access 7 global memory buffers, do something on the values and store the result back to a 8th global memory buffer. As I observed, as the input size increases, the L1 cache miss ratio (=misses(misses + hits))…
Zk1001
  • 2,033
  • 4
  • 19
  • 36
18
votes
2 answers

Error using Tensorflow with GPU

I've tried a bunch of different Tensorflow examples, which works fine on the CPU but generates the same error when I'm trying to run them on the GPU. One little example is this: import tensorflow as tf # Creates a graph. a = tf.constant([1.0, 2.0,…
user5654767
  • 271
  • 1
  • 3
  • 6
18
votes
2 answers

Continuous Integration Service for GPU package?

Continuous integration services are wonderful for continually testing updates to packages for various languages. These include services like Travis-CI, Jenkins, and Shippable among many others. However, as I have explored these different services…
cdeterman
  • 19,630
  • 7
  • 76
  • 100
18
votes
4 answers

Double precision floating point in CUDA

Does CUDA support double precision floating point numbers? Also, what are the reasons for the same?
cuda-dev
  • 181
  • 1
  • 1
  • 3
18
votes
3 answers

Why does CUDA code run so much faster in NVIDIA Visual Profiler?

A piece of code that takes well over 1 minute on the command line was done in a matter of seconds in NVIDIA Visual Profiler (running the same .exe). So the natural question is why? Is there something wrong with command line, or does Visual Profiler…
mchen
  • 9,808
  • 17
  • 72
  • 125
18
votes
2 answers

Numpy, BLAS and CUBLAS

Numpy can be "linked/compiled" against different BLAS implementations (MKL, ACML, ATLAS, GotoBlas, etc). That's not always straightforward to configure but it is possible. Is it also possible to "link/compile" numpy against NVIDIA's CUBLAS…
Ümit
  • 17,379
  • 7
  • 55
  • 74
17
votes
5 answers

Is it worth offloading FFT computation to an embedded GPU?

We are considering porting an application from a dedicated digital signal processing chip to run on generic x86 hardware. The application does a lot of Fourier transforms, and from brief research, it appears that FFTs are fairly well suited to…
Ian Renton
  • 699
  • 2
  • 8
  • 21
17
votes
4 answers

Any Lisp extensions for CUDA?

I just noted that one of the first languages for the Connection-Machine of W.D. Hillis was *Lisp, an extension of Common Lisp with parallel constructs. The Connection-Machine was a massively parallel computer with SIMD architecture, much the same as…
Halberdier
  • 1,164
  • 11
  • 15
17
votes
3 answers

Basic GPU application, integer calculations

Long story short, I have done several prototypes of interactive software. I use pygame now (python sdl wrapper) and everything is done on CPU. I am starting to port it to C now and at the same time search for the existing possibilities to use some…
Mikhail V
  • 1,416
  • 1
  • 14
  • 23
17
votes
1 answer

__forceinline__ effect at CUDA C __device__ functions

There is a lot of advice on when to use inline functions and when to avoid it in regular C coding. What is the effect of __forceinline__ on CUDA C __device__ functions? Where should they be used and where be avoided?
Farzad
  • 3,288
  • 2
  • 29
  • 53
17
votes
4 answers

printing from cuda kernels

I am writing a cuda program and trying to print something inside the cuda kernels using the printf function. But when I am compiling the program then I am getting an error error : calling a host function("printf") from a __device__/__global__…
duttasankha
  • 717
  • 2
  • 10
  • 32
17
votes
1 answer

Is there memory protection on GPUs

I don't have much experience with GPUs so please forgive my ignorance. Nowadays, GPUs are being used as GPGPUs for general purpose programming. But I was wondering if GPUs have memory protection and virtualization mechanism. I mean, for example, you…
pythonic
  • 20,589
  • 43
  • 136
  • 219
16
votes
3 answers

In OpenCL, what does mem_fence() do, as opposed to barrier()?

Unlike barrier() (which I think I understand), mem_fence() does not affect all items in the work group. The OpenCL spec says (section 6.11.10), for mem_fence(): Orders loads and stores of a work-item executing a kernel. (so it applies to a single…
andrew cooke
  • 45,717
  • 10
  • 93
  • 143
16
votes
2 answers

Why does my OpenCL kernel fail on the nVidia driver, but not Intel (possible driver bug)?

I originally wrote an OpenCL program to calculate very large hermitian matrices, where the kernel calculates a single pair of entries in the matrix (the upper triangular portion, and its lower triangular complement). Very early on, I found a very…
stix
  • 1,140
  • 13
  • 36