Questions tagged [bank-conflict]

a latency problem due to multi-threaded access to a shared memory system. At present, this latency issue is most common in nVidia and ATI graphics cards.

nVidia and ATI graphics cards with shared memory experience bank-conflicts when multiple threads attempt to access (out-of-order) a common bank of memory. For more information, please see the Stack Overflow question:

What is a bank conflict? (Doing Cuda/OpenCL programming)

41 questions
2
votes
1 answer

shared memory bank conflict with char array

I understand the bank conflict when dealing with 4-byte data types, but I wonder if we get any bank conflict (4-way/8-way?) with the following code __shared__ char shared[]; foo = shared[threadIdx.x]; The above code leads to 4 consecutive threads…
Karl
  • 21
  • 1
2
votes
1 answer

CUDA bank conflict for L1 cache?

On NVIDIA's 2.x architecture, each warp has 64kb of memory that is by default partitioned into 48kb of Shared Memory and 16kb of L1 cache (servicing global and constant memory). We all know about the bank conflicts of accessing Shared Memory - the…
cmo
  • 3,762
  • 4
  • 36
  • 64
1
vote
0 answers

Understanding the Reduction in Bank Conflicts in CUDA Kernels

I'm working with different CUDA kernels (gemm3, gemm4, and gemm5) for matrix multiplication: gemm3: baseline of shared memory GEMM gemm4: less thread blocks in x dimension gemm5: less blocks in both x and y dimension After profiling, I noticed…
1
vote
0 answers

Still bank conflict after shared memory padding

As the trick described in here, I tested the following code and got the corresponding profiling result. Conflicts were notably diminished, but some still persist. // store conflict __global__ void setRowReadCol(int *out){ __shared__ int…
1
vote
1 answer

Reading Shared/Local Memory Store/Load bank conflicts hardware counters for OpenCL executable under Nvidia

It is possible to use nvprof to access/read bank conflicts counters for CUDA exec: nvprof --events shared_st_bank_conflict,shared_ld_bank_conflict my_cuda_exe However it does not work for the code that uses OpenCL rather then CUDA code. Is there…
Artyom
  • 31,019
  • 21
  • 127
  • 215
1
vote
1 answer

OpenCL bank conflict - dropping memory / corrupting data?

I apologize in advance for the vagueness of this question. Background: I am attempting to write a morphological image processing function in OpenCL. I have a __local buffer which I use to store data for every pixel (each pixel is represented by a…
Reefpoints
  • 11
  • 3
1
vote
1 answer

CUDA: overloading of shared memory to implement reduction approach with multiple arrays

I have 5 large size arrays A(N*5), B(N*5), C(N*5), D(N*5), E(N*2) number 5 and 2 represents the components of these variables in different planes/axes. That's why I have structured arrays in this fashion so I can visualize the data when I am writing…
1
vote
1 answer

GPU shared memory practical example

I have an array like this: data[16] = {10,1,8,-1,0,-2,3,5,-2,-3,2,7,0,11,0,2} I want to compute the reduction of this array using shared memory on a G80 GPU. The kernel as cited in the NVIDIA document is like that: __global__ void reduce1(int…
1
vote
1 answer

Shared memory bank conflict in CUDA Fortran when loading 2D data from global memory

I am accessing global memory to load data to shared memory and would like to know if there is a bank conflict. Here is the setup: In global memory: g_array. A 2D matrix of size (256, 64) This is how I load the array data from global memory to…
Adjeiinfo
  • 159
  • 1
  • 2
  • 15
1
vote
1 answer

Shared memory configuration for prefetching

In my program I use shared memory to do prefetching of data. A 2D block of threads, dimentions 8 by 4 (32), gets 8 * 4 * 8 * sizeof(float4) bytes of shared memory. Each thread copies 8 float4s in a loop: inline __device__ void pack(const float4…
Dori
  • 675
  • 1
  • 7
  • 26
1
vote
1 answer

What's the mechanism of the warps and the banks in CUDA?

I'm a rookie in learning CUDA parallel programming. Now I'm confused in the global memory access of device. It's about the warp model and coalescence. There are some points: It's said that threads in one block are split into warps. In each warp…
Han
  • 397
  • 1
  • 6
  • 18
1
vote
1 answer

Bank conflicts in 2.x devices

What is a bank conflict in devices with 2.x devices? As I understand the CUDA C programming guide, in 2.x devices, if two threads access the same 32 bit word in the same shared memory bank, it does not cause a bank conflict. Instead, the word is…
gmemon
  • 2,573
  • 5
  • 32
  • 37
1
vote
1 answer

Does reading an int array from shared memory preclude bank conflicts?

I am designing a CUDA kernel that will be launched with 16 threads per thread block. I have an array of N ints in shared memory (i.e. per thread block) that I wish to process. If the access pattern of the threads is consecutive into the array then…
twerdster
  • 4,977
  • 3
  • 40
  • 70
0
votes
2 answers

Bank-Conflict-Free Access in shared memory

I have to use shared memory that is 64 elements in size, twice the number of banks and threads in a warp. How should I address them to yield a bank-conflict-free access?
0
votes
1 answer

CUDA shared memory bank conflict unexpected timing

I was trying to reproduce a bank conflict scenario (minimal working example here) and decided to perform a benchmark when a warp (32 threads) access 32 integers of size 32-bits each in the following 2 scenarios: When there is no bank conflict…
Ferdinand Mom
  • 59
  • 1
  • 5