Questions tagged [gpgpu]

GPGPU is an acronym for the field of computer science known as "General Purpose computing on the Graphics Processing Unit (GPU)"

GPGPU is an acronym for the field of computer science known as "General Purpose computing on the Graphics Processing Unit (GPU)". The two biggest manufacturers of GPUs are NVIDIA and AMD, although Intel has recently been moving in this direction with the Haswell APUs . There are two popular frameworks for GPGPU - NVidia's CUDA, which is only supported on its own hardware, and OpenCL developed by the Khronos Group. The latter is a consortium including all of AMD, NVidia, Intel, Apple and others, but the OpenCL standard is only half-heartedly supported by NVidia - creating a partial reflection of the rivalry among GPU manufacturers in the rivalry of programming frameworks.

The attractiveness of using GPUs for other tasks largely stems from the parallel processing capabilities of many modern graphics cards. Some cards can have thousands of streams processing similar data at incredible rates.

In the past, CPUs first emulated threading/multiple data streams through interpolation of processing tasks. Over time, we gained multiple cores with multiple threads. Now video cards house a number of GPUs, hosting many more threads or streams than many CPUs, and extremely fast memory integrated together. This huge increase of threads in execution is achieved thanks to the technique SIMD which stands for Single Instruction Multiple Data. This makes an environment uniquely suited for heavy computational loads that are able to undergo parallelization. Furthermore this technique also marks one of main differences between GPUs and CPUs as they are doing best what they were designed for.

More information at http://en.wikipedia.org/wiki/GPGPU

2243 questions
8
votes
4 answers

high precision math on GPU

I'm interested in implementing an algorithm on the GPU using HLSL, but one of my main concerns is that I would like a variable level of precision. Are there techniques out there to emulate 64bit precision and higher that could be implemented on the…
Mark
  • 267
  • 5
  • 10
8
votes
4 answers

Overlapping transfers and device computation in OpenCL

I am a beginner with OpenCL and I have difficulties to understand something. I want to improve the transfers of an image between host and device. I made a scheme to better understand me. Top: what I have now | Bottom: what I want HtD (Host to…
Alex Placet
  • 567
  • 7
  • 20
8
votes
3 answers

How is a CUDA kernel launched?

I have created a simple CUDA application to add two matrices. It is compiling fine. I want to know how the kernel will be launched by all the threads and what will the flow be inside CUDA? I mean, in what fashion every thread will execute each…
ATG
  • 732
  • 3
  • 11
  • 21
7
votes
3 answers

CUDA, OpenCL, PGI, etc.... but what happened to GLSL and Cg?

CUDA, OpenCL, and the GPU options offered by the Portland Group are intriguing... Results are impresive (125-times speedup for some groups). It sounds like the next wave of GPGPU tools are poised to dominate the scientific computing world. …
Pete
  • 10,310
  • 7
  • 53
  • 59
7
votes
2 answers

efficient GPU random memory access with OpenGL

What is the best pattern to get a GPU efficiently calculate 'anti-functional' routines, that usually depend on positioned memory writes instead of reads? Eg. like calculating a histogram, sorting, dividing a number by percentages, merging data of…
dronus
  • 10,774
  • 8
  • 54
  • 80
7
votes
2 answers

OpenCL scalar vs vector

I have simple kernel: __kernel vecadd(__global const float *A, __global const float *B, __global float *C) { int idx = get_global_id(0); C[idx] = A[idx] + B[idx]; } Why when I change float to float4, kernel…
ldanko
  • 557
  • 8
  • 20
7
votes
1 answer

initializer not allowed for __shared__ variable for cuda

I am doing the following: __shared__ int exForBlockLessThanP = totalElementLessThanPivotEntireBlock[blockIdx.x]; where totalElementLessThanPivotEntireBlock is an array on GPU. The compiler is throwing as error as stated in the title of the…
Programmer
  • 6,565
  • 25
  • 78
  • 125
7
votes
1 answer

Will C++ AMP run on a machine without a compatible GPU?

I understand that C++ AMP is accelerated by GPUs that support DirectX 11. However, my question is, if the compiled C++ AMP program is run on a machine without a DirectX 11 compatible GPU, what happens? Does it get emulated by some software…
Jonathan DeCarlo
  • 2,798
  • 1
  • 20
  • 24
7
votes
1 answer

Should I look into PTX to optimize my kernel? If so, how?

Do you recommend reading your kernel's PTX code to find out to optimize your kernels further? One example: I read, that one can find out from the PTX code if the automatic loop unrolling worked. If this is not the case, one would have to unroll the…
Framester
  • 33,341
  • 51
  • 130
  • 192
7
votes
0 answers

Kernel silently fails to execute

I am trying to write MergeSort in OpenCL (I know, BitonicSort is faster, but I want to compare them) and currently I have came accross a strange problem: If I set global size to 1 << 24 and local size to 512, the kernel just fails to being executed…
Radim Vansa
  • 5,686
  • 2
  • 25
  • 40
7
votes
3 answers

GPU-"Proof" Hash Function(s)?

I am thinking about designing a p2p network that requires a certain level of proof-of-work for vetting of users (similar to bitcoin) and regulation of spam/ddos. Due to the nature of p2p, the only feasible POW architecture I have seen is the…
user862319
7
votes
1 answer

Synchronizations in GPUs

I have some question about how GPUs perform synchronizations. As I know, when a warp encounters a barrier (assuming it is in OpenCL), and it knows that the other warps of the same group haven't been there yet. So it has to wait. But what exactly…
Zk1001
  • 2,033
  • 4
  • 19
  • 36
7
votes
2 answers

How to execute parallel compute shaders across multiple compute queues in Vulkan?

Update: This has been solved, you can find further details here: https://stackoverflow.com/a/64405505/1889253 A similar question was asked previously, but that question was initially focused around using multiple command buffers, and triggering the…
axsauze
  • 343
  • 4
  • 14
7
votes
3 answers

How many 'CUDA cores' does each multiprocessor of a GPU have?

I know that devices before the Fermi architecture had 8 SPs in a single multiprocessor. Is the count same in Fermi architecture?
jwdmsd
  • 2,107
  • 2
  • 16
  • 30
7
votes
1 answer

Shouldn't be 3x3 convolution much faster on GPU (OpenCL)

I'm learning how to optimize code for GPU. I read about importance of memory locality. I've also seen some tutorials and examples of GPU convolution. Based on that I wrote and tested several own kernels. Surprisingly I found that the simplest naive…
Prokop Hapala
  • 2,424
  • 2
  • 30
  • 59