Questions tagged [gpgpu]

GPGPU is an acronym for the field of computer science known as "General Purpose computing on the Graphics Processing Unit (GPU)"

GPGPU is an acronym for the field of computer science known as "General Purpose computing on the Graphics Processing Unit (GPU)". The two biggest manufacturers of GPUs are NVIDIA and AMD, although Intel has recently been moving in this direction with the Haswell APUs . There are two popular frameworks for GPGPU - NVidia's CUDA, which is only supported on its own hardware, and OpenCL developed by the Khronos Group. The latter is a consortium including all of AMD, NVidia, Intel, Apple and others, but the OpenCL standard is only half-heartedly supported by NVidia - creating a partial reflection of the rivalry among GPU manufacturers in the rivalry of programming frameworks.

The attractiveness of using GPUs for other tasks largely stems from the parallel processing capabilities of many modern graphics cards. Some cards can have thousands of streams processing similar data at incredible rates.

In the past, CPUs first emulated threading/multiple data streams through interpolation of processing tasks. Over time, we gained multiple cores with multiple threads. Now video cards house a number of GPUs, hosting many more threads or streams than many CPUs, and extremely fast memory integrated together. This huge increase of threads in execution is achieved thanks to the technique SIMD which stands for Single Instruction Multiple Data. This makes an environment uniquely suited for heavy computational loads that are able to undergo parallelization. Furthermore this technique also marks one of main differences between GPUs and CPUs as they are doing best what they were designed for.

More information at http://en.wikipedia.org/wiki/GPGPU

2243 questions
14
votes
1 answer

Branch predication on GPU

I have a question about branch predication in GPUs. As far as I know, in GPUs, they do predication with branches. For example I have a code like this: if (C) A else B so if A takes 40 cycles and B takes 50 cycles to finish execution, if assuming…
Zk1001
  • 2,033
  • 4
  • 19
  • 36
14
votes
2 answers

Are GPU/CUDA cores SIMD ones?

Let's take the nVidia Fermi Compute Architecture. It says: The first Fermi based GPU, implemented with 3.0 billion transistors, features up to 512 CUDA cores. A CUDA core executes a floating point or integer instruction per clock for a thread. The…
Marc Andreson
  • 3,405
  • 5
  • 35
  • 51
14
votes
1 answer

Why is the constant memory size limited in CUDA?

According to "CUDA C Programming Guide", a constant memory access benefits only if a multiprocessor constant cache is hit (Section 5.3.2.4)1. Otherwise there can be even more memory requests for a half-warp than in case of the coalesced global…
AdelNick
  • 982
  • 1
  • 8
  • 17
13
votes
2 answers

GPU utilization 0% during TensorFlow retraining for poets

I am following instructions for TensorFlow Retraining for Poets. GPU utilization seemed low so I instrumented the retrain.py script per the instructions in Using GPU. The log verifies that the TF graph is being built on GPU. I am retraining for a…
Lars Ericson
  • 1,952
  • 4
  • 32
  • 45
13
votes
4 answers

Import PGP public key by string

I want to import a PGP public key into my keychain in a script, but I don't want it to write the contents to a file. Right now my script does this: curl http://example.com/pgp-public-key -o /tmp/pgp && gpg --import /tmp/gpg How could I write this…
Paradoxis
  • 4,471
  • 7
  • 32
  • 66
13
votes
3 answers

Does Global Work Size Need to be Multiple of Work Group Size in OpenCL?

Hello: Does Global Work Size (Dimensions) Need to be Multiple of Work Group Size (Dimensions) in OpenCL? If so, is there a standard way of handling matrices not a multiple of the work group dimensions? I can think of two possibilities: Dynamically…
Junier
  • 1,622
  • 1
  • 15
  • 21
13
votes
1 answer

OpenMP 4.0 in GCC: offload to nVidia GPU

TL;DR - Does GCC (trunk) already support OpenMP 4.0 offloading to nVidia GPU? If so, what am I doing wrong? (description below). I'm running Ubuntu 14.04.2 LTS. I have checked out the most recent GCC trunk (dated 25 Mar 2015). I have installed the…
Marc Andreson
  • 3,405
  • 5
  • 35
  • 51
13
votes
8 answers

Feasibility of GPU as a CPU?

What do you think the future of GPU as a CPU initiatives like CUDA are? Do you think they are going to become mainstream and be the next adopted fad in the industry? Apple is building a new framework for using the GPU to do CPU tasks and there has…
AutomaticPixel
  • 261
  • 6
  • 11
13
votes
1 answer

Are GPU shaders Turing complete

I understand that complete GPUs are behemoths of computing - including every step of calculation, and memory. So obviously a GPU can compute whatever we want - it's Turing complete. My question is in regard to a single shader on various GPUs…
Trevor
  • 1,858
  • 4
  • 21
  • 28
13
votes
4 answers

Parallel GPU computing using OpenCV

I have an application that requires processing multiple images in parallel in order to maintain real-time speed. It is my understanding that I cannot call OpenCV's GPU functions in a multi-threaded fashion on a single CUDA device. I have tried an…
mmccullo
  • 173
  • 1
  • 1
  • 7
13
votes
5 answers

How to check for GPU on CentOS Linux

It is suggested that on Linux, GPU be found with the command lspci | grep VGA. It works fine on Ubuntu but when I try to use the same on CentOS, it says lspci command is not found. How can I check for the GPU card on CentOS. And note that I'm not…
pythonic
  • 20,589
  • 43
  • 136
  • 219
12
votes
6 answers

Disassemble an OpenCL kernel?

I'm not sure if it's possible. I want to study OpenCL in-depth, so I was wondering if there is a tool to disassemble an compiled OpenCL kernel. For normal x86 executable, I can use objdump to get a disassembly view. Is there a similar tool for…
Patrick
  • 4,186
  • 9
  • 32
  • 45
12
votes
1 answer

How to calculate pairwise distance matrix on the GPU

The bottleneck in my code is the area where I calculate a pairwise distance matrix. Since this is the slowest part by far, I have spent much time in speeding up my code. I have found many speedups using articles online, but the gains have been…
Paul Terwilliger
  • 1,596
  • 1
  • 20
  • 45
12
votes
2 answers

GPU Shared Memory Bank Conflict

I am trying to understand how bank conflicts take place. I have an array of size 256 in global memory and I have 256 threads in a single block, and I want to copy the array to shared memory. Therefore every thread copies one…
scatman
  • 14,109
  • 22
  • 70
  • 93
12
votes
1 answer

How do I use Nvidia Multi-process Service (MPS) to run multiple non-MPI CUDA applications?

Can I run non-MPI CUDA applications concurrently on NVIDIA Kepler GPUs with MPS? I'd like to do this because my applications cannot fully utilize the GPU, so I want them to co-run together. Is there any code example to do this?
dalibocai
  • 2,289
  • 5
  • 29
  • 45