Questions tagged [memory-bandwidth]

79 questions
134
votes
13 answers

Any optimization for random access on a very big array when the value in 95% of cases is either 0 or 1?

Is there any possible optimization for random access on a very big array (I currently use uint8_t, and I'm asking about what's better) uint8_t MyArray[10000000]; when the value at any position in the array is 0 or 1 for 95% of all cases, 2 in 4%…
JohnAl
  • 1,064
  • 2
  • 10
  • 18
86
votes
1 answer

memory bandwidth for many channels x86 systems

I'm testing the memory bandwidth on a desktop and a server. Sklyake desktop 4 cores/8 hardware threads Skylake server Xeon 8168 dual socket 48 cores (24 per socket) / 96 hardware threads The peak bandwidth of the system is Peak bandwidth desktop =…
Z boson
  • 32,619
  • 11
  • 123
  • 226
52
votes
8 answers

How to increase performance of memcpy

Summary: memcpy seems unable to transfer over 2GB/sec on my system in a real or test application. What can I do to get faster memory-to-memory copies? Full details: As part of a data capture application (using some specialized hardware), I need to…
leecbaker
  • 3,611
  • 2
  • 35
  • 51
40
votes
4 answers

Why vectorizing the loop over 64-bit elements does not have performance improvement over large buffers?

I am investigating the effect of vectorization on the performance of the program. In this regard, I have written following code: #include #include #include #define LEN 10000000 int main(){ struct timeval…
Pouya
  • 1,871
  • 3
  • 20
  • 25
14
votes
5 answers

Can the Intel performance monitor counters be used to measure memory bandwidth?

Can the Intel PMU be used to measure per-core read/write memory bandwidth usage? Here "memory" means to DRAM (i.e., not hitting in any cache level).
BeeOnRope
  • 60,350
  • 16
  • 207
  • 386
13
votes
3 answers

What specifically marks an x86 cache line as dirty - any write, or is an explicit change required?

This question is specifically aimed at modern x86-64 cache coherent architectures - I appreciate the answer can be different on other CPUs. If I write to memory, the MESI protocol requires that the cache line is first read into cache, then modified…
Tim
  • 916
  • 7
  • 21
10
votes
1 answer

MOVSD performance depends on arguments

I just noticed a pieces of my code exhibit different performance when copying memory. A test showed that a memory copying performance degraded if the address of destination buffer is greater than address of source. Sounds ridiculous, but the…
user4859735
  • 103
  • 6
10
votes
3 answers

How to get memory bandwidth from memory clock/memory speed

FYI, Here are the specs I got from Nvidia http://www.geforce.com/hardware/desktop-gpus/geforce-gtx-680/specifications http://www.geforce.com/hardware/desktop-gpus/geforce-gtx-titan/specifications Note that the memory speed/memory clock are the same…
Blue_Black
  • 307
  • 1
  • 3
  • 11
9
votes
3 answers

what does STREAM memory bandwidth benchmark really measure?

I have a few questions on STREAM (http://www.cs.virginia.edu/stream/ref.html#runrules) benchmark. Below is the comment from stream.c. What is the rationale about the requirement that arrays should be 4 times the size of cache? * (a) Each…
9
votes
1 answer

Roofline model: calculating operational intensity

Say I have a toy loop like this float x[N]; float y[N]; for (int i = 1; i < N-1; i++) y[i] = a*(x[i-1] - x[i] + x[i+1]) And I assume my cache line is 64 Byte (i.e. big enough). Then I will have (per frame) basically 2 accesses to the RAM and 3…
Armen Avetisyan
  • 1,140
  • 10
  • 29
9
votes
2 answers

Why is memset slow?

The spec for my CPU says it should get 5.336GB/s bandwidth to memory. To test this, I wrote a simple program that runs memset (or memcpy) on a big array and reports the timing. I'm showing 3.8GB/s on memset and 1.9GB/s on memcpy. …
Jeff Guy
  • 157
  • 1
  • 9
7
votes
2 answers

OpenMP and cores/threads

My CPU is a Core i3 330M with 2 cores and 4 threads. When I execute the command cat /proc/cpuinfo in my terminal, it is like I have 4 CPUS. When I use the OpenMP function get_omp_num_procs() I also get 4. Now I have a standard C++ vector class, I…
Benjamin
  • 366
  • 1
  • 3
  • 8
6
votes
5 answers

Efficient memory bandwidth use for streaming

I have an application that streams through 250 MB of data, applying a simple and fast neural-net threshold function to the data chunks (which are just 2 32-bit words each). Based on the result of the (very simple) compute, the chunk is unpredictably…
SPWorley
  • 11,550
  • 9
  • 43
  • 63
5
votes
1 answer

Erroneous single thread memory bandwidth benchmark

In an attempt to measure the bandwidth of the main memory, I have come up with the following approach. Code (for the Intel compiler) #include #include // std::cout #include // std::numeric_limits #include //…
5
votes
0 answers

Can x86's lock prefix on uncacheable memory cause a Denial of Service on memory bandwidth?

Can an instruction with lock prefix starve rest of the CPUs (virtual machines) for memory bandwidth in a virtualized environment ? For example, consider the following piece of code loop: lock inc dword [rax] jmp loop Now assume that rax…
joz
  • 319
  • 1
  • 9
1
2 3 4 5 6