Questions tagged [gpgpu]

GPGPU is an acronym for the field of computer science known as "General Purpose computing on the Graphics Processing Unit (GPU)"

GPGPU is an acronym for the field of computer science known as "General Purpose computing on the Graphics Processing Unit (GPU)". The two biggest manufacturers of GPUs are NVIDIA and AMD, although Intel has recently been moving in this direction with the Haswell APUs . There are two popular frameworks for GPGPU - NVidia's CUDA, which is only supported on its own hardware, and OpenCL developed by the Khronos Group. The latter is a consortium including all of AMD, NVidia, Intel, Apple and others, but the OpenCL standard is only half-heartedly supported by NVidia - creating a partial reflection of the rivalry among GPU manufacturers in the rivalry of programming frameworks.

The attractiveness of using GPUs for other tasks largely stems from the parallel processing capabilities of many modern graphics cards. Some cards can have thousands of streams processing similar data at incredible rates.

In the past, CPUs first emulated threading/multiple data streams through interpolation of processing tasks. Over time, we gained multiple cores with multiple threads. Now video cards house a number of GPUs, hosting many more threads or streams than many CPUs, and extremely fast memory integrated together. This huge increase of threads in execution is achieved thanks to the technique SIMD which stands for Single Instruction Multiple Data. This makes an environment uniquely suited for heavy computational loads that are able to undergo parallelization. Furthermore this technique also marks one of main differences between GPUs and CPUs as they are doing best what they were designed for.

More information at http://en.wikipedia.org/wiki/GPGPU

2243 questions
7
votes
1 answer

How to Step-by-Step Debug OpenCL GPU Applications under Windows with a NVidia GPU

I would like to know wether you know of any way to step-by-step debug OpenCL Kernel using Windows (my IDE is Visual Studio) and running OpenCL Kernels on a NVidia GPU. What i found so far is: with NVidias NSight you can only profile OpenCL…
Michael
  • 848
  • 8
  • 13
7
votes
0 answers

Where is specified how OpenGL ES 2.0 represents float texture values in the fragment shader?

I am trying to do some GPGPU using OpenGL ES 2.0. It seems to me that the GL_NV_draw_buffers and the GL_OES_texture_float extensions are some of the essentials here. This question relates to the GL_OES_texture_float extension: From the desktop world…
rsp1984
  • 1,877
  • 21
  • 23
7
votes
2 answers

Proper way to inform OpenCL kernels of many memory objects?

In my OpenCL program, I am going to end up with 60+ global memory buffers that each kernel is going to need to be able to access. What's the recommended way to for letting each kernel know the location of each of these buffers? The buffers…
int3h
  • 462
  • 4
  • 15
7
votes
2 answers

What is the difference: DRAM Throughput vs Global Memory Throughput

The actual throughput achieved by a kernel is reported by CUDA profiler using four metrics: Global memory load throughput Global memory store throughput DRAM read throughput DRAM write throughput CUDA C Best Practices Guide describes Global memory…
user760944
  • 79
  • 1
  • 2
6
votes
2 answers

Poor performance for calculating eigenvalues and eigenvectors on GPU

In some code we need to get auto vectors and auto values for the generalized eigenvalue problem with symmetric real matrices (Ax=lamba Bx). This code uses DSPGVX from LACPACK. We wanted to speed it up on GPU using a MAGMA function. We asked on this…
Open the way
  • 26,225
  • 51
  • 142
  • 196
6
votes
1 answer

Can this OpenCL code be optimized?

I am working on a piece of OpencL code for a specialized matrix function: for a Dx1 vector v, two DxD matrices A and B and a constant c, return 1xD vector r where r[i] = c * sum_over_j (v[j] * A[i][j] * B[i][j]) Below is what I have so far, but it…
trolle3000
  • 1,067
  • 2
  • 14
  • 27
6
votes
1 answer

OpenCL: 32-bit and 64-bit popcnt instruction on GPU?

I want to write a program for GPU (preferrably OpenCL) and a large part of the computation consists of counting the number of 1's in a bit array (packed as long or int). So, on modern CPUs I would obviously just use the native __popcnt instruction.…
user1111929
  • 6,050
  • 9
  • 43
  • 73
6
votes
2 answers

Is cudaMemcpy from host to device executed in parallel?

I am curious if cudaMemcpy is executed on the CPU or the GPU when we copy from host to device? I other words, it the copy a sequential process or is it done in parallel? Let me explain why I ask this: I have an array of 5 million elements . Now, I…
Programmer
  • 6,565
  • 25
  • 78
  • 125
6
votes
3 answers

How much of a modern graphics pipeline uses dedicated hardware?

To put the question another way, if one were to try and reimplement OpenGL or DirectX (or an analogue) using GPGPU (CUDA, OpenCL), where and why would it be slower that the stock implementations on NVIDIA and AMD cards? I can see how…
DaedalusFall
  • 8,335
  • 6
  • 30
  • 43
6
votes
1 answer

4000% Performance Decrease in SYCL when using Unified Shared Memory instead of Device Memory

In SYCL, there are three types of memory: host memory, device memory, and Unified Shared Memory (USM). For host and device memory, data exchange requires explicit copying. Meanwhile, data movement from and to USM is automatically managed by the SYCL…
比尔盖子
  • 2,693
  • 5
  • 37
  • 53
6
votes
2 answers

GPU L1 and L2 cache statistics

I have written some simple benchmarks that perform a series of global memory accesses. When I measure the L1 and L2 cache statistics, I've found out that (in GTX580 that has 16 SMs): total L1 cache misses * 16 != total L2 cache queries Indeed…
Zk1001
  • 2,033
  • 4
  • 19
  • 36
6
votes
3 answers

Quick sort in GLSL?

I'm considering porting a large chunk of processing to the GPU using a GLSL shader. One of the immediate problems I stumbled across is that in one of the steps, the algorithm needs to maintain a list of elements, sort them and take the few largest…
shoosh
  • 76,898
  • 55
  • 205
  • 325
6
votes
2 answers

How does the opencl command queue work, and what can I ask of it

I'm working on an algorithm that does prettymuch the same operation a bunch of times. Since the operation consists of some linear algebra(BLAS), I thourght I would try using the GPU for this. I've writen my kernel and started pushing kernels on the…
Martin Kristiansen
  • 9,875
  • 10
  • 51
  • 83
6
votes
1 answer

How to measure the gflops of a matrix multiplication kernel?

In the book Programming Massively Parallel Processors the number of gflops is used to compare the efficiency of different matrix multiplication kernels. How would I compute this for my own kernels on my own machine? Somewhere in the NVIDIA Forums I…
Framester
  • 33,341
  • 51
  • 130
  • 192
6
votes
1 answer

What are the "long" and "short" scoreboards w.r.t. MIO/L1TEX?

With recent NVIDIA micro-architectures, there's a new (?) taxonomy of warp stall reasons / warp scheduler states. Two of the items in this taxonomy are: Short scoreboard - scoreboard dependency on an MIO queue operation. Long scoreboard -…
einpoklum
  • 118,144
  • 57
  • 340
  • 684