Questions tagged [memory-bandwidth]
79 questions
134
votes
13 answers
Any optimization for random access on a very big array when the value in 95% of cases is either 0 or 1?
Is there any possible optimization for random access on a very big array (I currently use uint8_t, and I'm asking about what's better)
uint8_t MyArray[10000000];
when the value at any position in the array is
0 or 1 for 95% of all cases,
2 in 4%…

JohnAl
- 1,064
- 2
- 10
- 18
86
votes
1 answer
memory bandwidth for many channels x86 systems
I'm testing the memory bandwidth on a desktop and a server.
Sklyake desktop 4 cores/8 hardware threads
Skylake server Xeon 8168 dual socket 48 cores (24 per socket) / 96 hardware threads
The peak bandwidth of the system is
Peak bandwidth desktop =…

Z boson
- 32,619
- 11
- 123
- 226
52
votes
8 answers
How to increase performance of memcpy
Summary:
memcpy seems unable to transfer over 2GB/sec on my system in a real or test application. What can I do to get faster memory-to-memory copies?
Full details:
As part of a data capture application (using some specialized hardware), I need to…

leecbaker
- 3,611
- 2
- 35
- 51
40
votes
4 answers
Why vectorizing the loop over 64-bit elements does not have performance improvement over large buffers?
I am investigating the effect of vectorization on the performance of the program. In this regard, I have written following code:
#include
#include
#include
#define LEN 10000000
int main(){
struct timeval…

Pouya
- 1,871
- 3
- 20
- 25
14
votes
5 answers
Can the Intel performance monitor counters be used to measure memory bandwidth?
Can the Intel PMU be used to measure per-core read/write memory bandwidth usage? Here "memory" means to DRAM (i.e., not hitting in any cache level).

BeeOnRope
- 60,350
- 16
- 207
- 386
13
votes
3 answers
What specifically marks an x86 cache line as dirty - any write, or is an explicit change required?
This question is specifically aimed at modern x86-64 cache coherent architectures - I appreciate the answer can be different on other CPUs.
If I write to memory, the MESI protocol requires that the cache line is first read into cache, then modified…

Tim
- 916
- 7
- 21
10
votes
1 answer
MOVSD performance depends on arguments
I just noticed a pieces of my code exhibit different performance when copying memory. A test showed that a memory copying performance degraded if the address of destination buffer is greater than address of source. Sounds ridiculous, but the…

user4859735
- 103
- 6
10
votes
3 answers
How to get memory bandwidth from memory clock/memory speed
FYI, Here are the specs I got from Nvidia
http://www.geforce.com/hardware/desktop-gpus/geforce-gtx-680/specifications
http://www.geforce.com/hardware/desktop-gpus/geforce-gtx-titan/specifications
Note that the memory speed/memory clock are the same…

Blue_Black
- 307
- 1
- 3
- 11
9
votes
3 answers
what does STREAM memory bandwidth benchmark really measure?
I have a few questions on STREAM (http://www.cs.virginia.edu/stream/ref.html#runrules) benchmark.
Below is the comment from stream.c. What is the rationale about the requirement that arrays should be 4 times the size of cache?
* (a) Each…

yeeha
- 139
- 2
- 8
9
votes
1 answer
Roofline model: calculating operational intensity
Say I have a toy loop like this
float x[N];
float y[N];
for (int i = 1; i < N-1; i++)
y[i] = a*(x[i-1] - x[i] + x[i+1])
And I assume my cache line is 64 Byte (i.e. big enough). Then I will have (per frame) basically 2 accesses to the RAM and 3…

Armen Avetisyan
- 1,140
- 10
- 29
9
votes
2 answers
Why is memset slow?
The spec for my CPU says it should get 5.336GB/s bandwidth to memory. To test this, I wrote a simple program that runs memset (or memcpy) on a big array and reports the timing. I'm showing 3.8GB/s on memset and 1.9GB/s on memcpy. …

Jeff Guy
- 157
- 1
- 9
7
votes
2 answers
OpenMP and cores/threads
My CPU is a Core i3 330M with 2 cores and 4 threads. When I execute the command cat /proc/cpuinfo in my terminal, it is like I have 4 CPUS. When I use the OpenMP function get_omp_num_procs() I also get 4.
Now I have a standard C++ vector class, I…

Benjamin
- 366
- 1
- 3
- 8
6
votes
5 answers
Efficient memory bandwidth use for streaming
I have an application that streams through 250 MB of data, applying a simple and fast neural-net threshold function to the data chunks (which are just 2 32-bit words each). Based on the result of the (very simple) compute, the chunk is unpredictably…

SPWorley
- 11,550
- 9
- 43
- 63
5
votes
1 answer
Erroneous single thread memory bandwidth benchmark
In an attempt to measure the bandwidth of the main memory, I have come up with the following approach.
Code (for the Intel compiler)
#include
#include // std::cout
#include // std::numeric_limits
#include //…

Nitin Malapally
- 534
- 2
- 10
5
votes
0 answers
Can x86's lock prefix on uncacheable memory cause a Denial of Service on memory bandwidth?
Can an instruction with lock prefix starve rest of the CPUs (virtual machines) for memory bandwidth in a virtualized environment ?
For example, consider the following piece of code
loop:
lock inc dword [rax]
jmp loop
Now assume that rax…

joz
- 319
- 1
- 9