Should I prefer stride one memory access for either reading or writing?

Question

It's well known that accessing memory in a stride one fashion is best for performance.

In situations where

I must access one region of memory for reading,
I must access another region for writing, and
I may only access one of the two regions in a stride one fashion,

should I prefer reading stride one or writing stride one?

One simple, concrete example is a BLAS-like copy-and-permute operation like y := P x. The permutation matrix P is defined entirely by some permutation vector q(i). It has a corresponding inverse permutation vector qinv(i). One could code the required loop as y[qinv(i)] = x[i] or as y[i]=x[q(i)] where the former reads from x stride one and the latter writes to y stride one.

Ideally one could always code both possibilities, profile them under representative conditions, and choose the faster version. Pretend you could only code one version-- which access pattern would you always anticipate being faster based on the behavior of modern memory architectures? Does working in a threaded environment change your response?

Not well known by me. :-) I don't get it. If you could give an example in C maybe I could give an answer, but on the OTOH you might not be interested in an answer from me if I don't get what a BLAS is... ;-) — Prof. Falken, Jan 26 '12 at 15:56
The BLAS are the Basic Linear Algebra Subprograms (http://netlib.org/blas/), a set of high performance numerical building blocks commonly implemented by vendors. — Rhys Ulerich, Jan 26 '12 at 21:43

Evgeny Kluev · Accepted Answer · 2012-01-26T18:05:33.617

Access pattern, that you name "writes stride one" (y[i]=x[q(i)]), is usually faster.

If memory is cached and your data pieces are smaller than cache line, this access pattern requires less memory bandwidth.

It is usual for modern processors to have more load execution units, than store units. And next Intel architecture, named Haswell, supports only GATHER instruction, while SCATTER is not yet in their plans. All this is also in favor of "writes stride one" pattern.

Working in a threaded environment does not change this.

score 2 · Answer 2 · answered Jun 02 '19 at 08:24

I'd like to share results of my simple benchmarks. Suppose we have two square NxN matrices A and B of doubles, and we want to perform a copy with a transposition:

A = transpose(B)

Algorithms:

Two nested loops such that reads are contiguous and writes are strided.
Two nested loops such that reads are strided and writes are contiguous.
Sequential MKL's mkl_domatcopy.

Copy without transposition is used as a baseline. Values of N are taken to be 2^K + 1 to mitigate cache associativity effects.

Intel Core i7-4770 with GCC 8.3.0 (-O3 -m64 -march=native) and Intel MKL 2019.0.1:

Intel Core i7-4770

Intel Xeon E5-2650 v3 with GCC 7.3.0 (-O3 -m64 -march=native) and Intel MKL 2017.0.1:

Intel Xeon E5-2650 v3

Numbers and C++ source code

That's fun. Thank you for quantifying. – Rhys Ulerich Jun 02 '19 at 12:28 — Rhys Ulerich, Jun 02 '19 at 12:28

Should I prefer stride one memory access for either reading or writing?

2 Answers2