Program running slower when doing less calculation in N-body simulation?

Question

I'm writing a simple N-body simulation with particle-particle interaction. I noticed that when I calculate the relative position among particles my code actually runs slower when it performs less calculations.

At first I tried the straight-forward implementation (assume 1D for simplicity):

for(int i=0; i<N; i++)
{
    for(int j=0; j<N; j++)
   {
       r_rel[i][j] = r[i]-r[j];
   }
}

This is just like filling a NxN matrix. These loops compute every r_rel twice since, in fact, r_rel[i][j] = -r_rel[j][i]. Therefore I tried sparing some calculations implementing the following solution:

for(int i=1; i<N; i++)
{
    for(int j=0; j<i; j++)
   {
       r_rel[i][j] = r[i]-r[j];
       r_rel[j][i] = -r_rel[i][j];

   }
}

This way I only actually calculate the terms below the diagonal in my NxN matrix of relative positions. I expected the code to be faster as it performs less calculation but when I execute is it runs noticeably slower. How is this even possible? Thanks!!

Do I understand correctly that the second version is processing only half of the matrix? In that case I am not surprised that it still is slower, because what is done for each element looks like it could take about twice the time and could have worse cache behaviour. How much slower is it? A factor of 2, 20, 100? — Yunnosch, Apr 20 '19 at 08:46
Yes, it only precesses half of the matrix and it's about twice as slower. — Isaac, Apr 20 '19 at 08:51
are you sure you compare performance of the two versions using same data and configuration? — Nikos M., Apr 20 '19 at 08:52
to be sure, the computational complexity of the two versions is still `O(n^2)` for both, the second version does not reduce that factor it is simply a constant factor that is affected — Nikos M., Apr 20 '19 at 08:53
Yes same N and initial particle configuration. I agree that they are both `O(N^2)` but the first performs N^2 calculations while the second only N(N-1)/2. I expected it to be a little faster in actual computation time and don't understand why it is not. — Isaac, Apr 20 '19 at 09:03
Maybe there is some relation to CPU cache, memory alignments, (https://stackoverflow.com/questions/8547778/why-are-elementwise-additions-much-faster-in-separate-loops-than-in-a-combined-l) — stenliis, Apr 20 '19 at 10:30

Eric Postpischil · Accepted Answer · 2019-04-21T12:12:40.157

The first loop traverses r_rel in consecutive memory order, proceeding through each row before proceeding to the next: It accesses r_rel[i][j] while iterating through each value of j before incrementing i.

The second loop traverses r_rel with two moving points of access, one proceeding in consecutive memory order and the proceeding through columns of the matrix, jumping across rows. This latter behavior is bad for cache and has poor performance. Traversing row-major arrays along columns is notoriously bad for cache performance.

Cache is expensive high-performance memory that is used to hold copies of recently accessed data or data that has been loaded from memory in anticipation of use in the near future. When a program uses memory in a way that often accesses data that is in cache, it may benefit from the high performance of cache memory. When a program often accesses data that is not in cache, the processor must access data in in general memory, which is much slower than cache.

Typical features of cache design include:

Cache is organized into lines, which are units of contiguous memory. 64 bytes is a typical line size.
Cache lines are organized into sets. Each set is associated with certain bits from the memory address. Cache might have, for example, two or eight lines in each set.
Within each set, a line may be a copy of any portion of memory that has the bits assigned to its set. For example, consider an address with bits aa…aaabbbcccccc. The six c bits tell us which byte this is within a cache line. (2⁶ = 64.) The three b bits tell us which cache set this byte must go into. The a bits are recorded with the cache line, remembering where in memory it belongs.

When a process is working through r_rel[i][j] in consecutive memory order, then, each time it accesses a member of r_rel, the one it accesses is either part of the same cache line just accessed in the previous iteration or it is the in the very next cache line. In the former case, the data is already in cache and is available to the processor quickly. In the latter case, it has to be fetched from memory. (Some processors will have already initiated this fetch, as they pre-fetch data that is ahead of recent accesses to memory. They are designed to do this because such memory access is a common pattern.)

From the above, we can see that the first set of code will have to perform one load of a cache line for each cache line in r_rel. Below, we will compare this number to the similar number for the second set of code.

In the second set of code, one of the uses of r_rel proceeds the same was as the first set of code, although it traverses only half the array. For r_rel[i][j], it performs about half the cache loads of the first code. It performs a few extra loads because of some inefficient use along the diagonal, but we can neglect that.

However, other use of r_rel, r_rel[j][i], is troublesome. It proceeds through rows of r_rel.

The question does not give us many details, so I will make up some values for illustration. Suppose the elements of r_rel are four bytes each, and N, the number of elements in a row or column, is a multiple of 128. Also suppose the cache is 32,768 bytes organized into 64 sets of 8 lines of 64 bytes each. With this geometry, the residue (remainder when divided) of the address modulo 512 determines which cache set the memory must be assigned to.

So, what happens when r_rel[j][i] is accessed is that the 64 bytes of memory around that address are brought into cache and assigned to a particular cache set. When, when j is incremented, the memory around that address is brought into cache and assigned to a particular cache set. These are the same cache set. Because the rows are 128 elements, and each element is four bytes, the distance between two elements that are exactly one row apart is 128•4 = 512 bytes, which is the same as the number used to determine which cache set a line goes into. So these two elements get assigned to the same cache set.

That is fine at first. The cache set has eight lines. Unfortunately, the code continues iterating j. Once j has been incremented eight times, it accesses a ninth element of r_rel. Since a cache set has only eight lines, the processor must remove one of the previous lines from the set. As the code continues to iterate j, more lines are removed. Eventually, all the previous lines are removed. When the code finishes its iteration of j and increments i, it returns to near the beginning of the array.

Recall that, in the first set of code, when r_rel[0][2] were accessed, it was still in cache from when r_rel[0][1] had been accessed. However, in the second set of code, r_rel[0][2] is long gone from cache. The processor must load it again.

For the accesses to r_rel[j][i], the second set of code gets effectively no benefit from cache. It has to load from memory for each access. Since, in this example, there are 16 elements in each cache line (four-byte elements, 64-byte lines), it has approximately 16 times as many memory accesses for half the matrix.

In total, if there are x cache lines in the entire array, the first set of code loads x cache lines, and the second set of code loads about x/2 cache lines for the r_rel[i][j] accesses and about x/2•16 = 8•x cache lines for the r_rel[i][j] accesses, a total of 8.5•x cache line loads.

Traversing an array in column order is terrible for cache performance.

The above used example numbers. One that is most flexible is the array size, N. I assumed it was a multiple of 64. We can consider some other values. If it is a multiple of 32 instead, then r_rel[j][i] and r_rel[j+1][i] will map to different cache sets. However, r_rel[j][i] and r_rel[j+2][i] map to the same set. This means that, after eight iterations of j, only four lines in each set will have been used, so old lines will not yet need to be evicted. Unfortunately, this helps very little, because, once i exceeds 16, the code is iterating j through enough values that the cache set is again emptied of earlier lines, so each loop on j must load every cache line it encounters.

On the other hand, setting N to a value such as 73 might mitigate some of this effect. Of course, you do not want to change the size of your array just to suit the computer hardware. However, one thing you can do is make the dimensions of the array in memory N by NP even though only N by N elements are used. NP (standing for “N Padded”) is chosen to make the rows an odd size relative to cache geometry. The extra elements are simply wasted.

That provides a quick way to change the program to demonstrate that cache effects are making it slow, but it is usually not a preferred solution. Another approach is to tile access to the array. Instead of iterating i and j through the entire array, the array is partitioned into tiles of some size, A rows by B columns. Two outer loops iterate through all the tiles, and two inner loops iterate through the array elements within each tile.

A and B are chosen so that all of the elements of one tile will remain in cache while the inner loops proceed. For the sample numbers above, A and B would have to be eight or less, because only eight rows of the array can be held in one cache set. (There may be other considerations that would make the optimal tile size somewhat smaller. Or, for different element sizes or values of N, the optimal tile size might be larger.)

Note that tiling raises some issues in writing the code. When processing a tile on the diagonal, the code will be handling elements from two points within the same tile. When processing a tile off the diagonal, the code will be handling elements from one point within one tile and a transposed point within another tile. This may affect both the code manipulating the array indices and the bounds of the inner loops. For on-diagonal tiles, the inner loops will look similar to your j < i condition, processing a triangle. For off-diagonal tiles, the inner loops will process a full square (or rectangle if A and B differ).

thorough analysis of cache and how it operates. is it possible though to prove that this is indeed the problem of the OP? Possible problem yes indeed. But is it the actual problem? — Nikos M., Apr 21 '19 at 11:53
@NikosM.: These behaviors are well known to practitioners of optimization. There is no question that cache properties cause these performance issues if the array exceeds cache size and `N` is a multiple of a modest power of two. If it were necessary to demonstrate it, there are ways to proceed, by using performance counters built into the processor that will reveal cache transactions and other events or by modifying the program in various ways to show how changes are related to cache properties. OP would have to supply more information, including the value of `N` and the type of `r_rel`. — Eric Postpischil, Apr 21 '19 at 12:18

Program running slower when doing less calculation in N-body simulation?

1 Answers1