2

I am new to profiling code with valgrind and cachegrind, and I recently started using these tools to see how my code was doing in regards to cache utilization. I found that a simple if-statement appears to cause a cache miss almost every single time it is executed. As an example, I make use of the following derived types in my Fortran program:

type :: particle
    real(pr), dimension(3) :: r = 0.0_pr ! position
    real(pr), dimension(3) :: p = 0.0_pr ! momentum
end type particle

type :: rcell ! position cell
    integer, dimension(6) :: bpoints = 0 ! cell grid points
    integer :: np = 0 ! number of particles in cell
    type(particle), dimension(50) :: parts ! particles in cell
end type rcell

type(rcell), dimension(:), allocatable :: rbin ! position space bins
allocate(rbin(100))

This example represents a 100 cells in position space that could contain up to 50 particles described by their position and momentum. The code uses a simple particle mover to update position and momentum of the particles at a given time step. To implement this, I use a loop like the following:

do i = 1, numcells
    if (rbin(i)%np == 0) cycle ! skip cells with no particles
    ...
end do

By including the if-statement I figured that I would be speeding the code up by cycling the loop when there are no particles in a given cell. However, I did some profiling of my code using valgrind with the cachegrind tool and found that this simple if-statement almost always results in a cache miss. The following is an example of the results for this if-statement using cg_annotate with the --auto=yes option enabled:

Ir: 21,600,000
I1mr: 0
ILmr: 0
Dr: 4,320,000
D1mr: 4,319,057
DLmr: 4,318,979
Dw: 0
D1mw: 0
DLmw: 0

This appears to be a cache miss almost every time it is executed. I do this a lot in my code when looping over cells, and I think it causing a major slow down. Is this a consequence of using derived types? Is there a way to improve the cache utilization here, or with derived types in general?

For completeness, I am compiling with gfortran 4.8.3 and using the following flags: -g -O3 -ffast-math -mcmodel=medium -fdefault-real-8

cstraats
  • 117
  • 4
  • If you would first create a list of cells with no elements and one with elements, you could use two loops instead and save the `if` statement. Does that speed up your code (significantly)? – Alexander Vogt May 12 '15 at 18:26
  • Basically, you mean use one loop to create an indexing array such that I know which cells contain particles, and then loop over the indexing array? This could certainly provide some speed up, and is worth a try. I still wonder if there is a way to improve the usage of the if-statement with the derived type. Maybe with a pointer? – cstraats May 12 '15 at 22:50
  • The cache miss may come from what you are doing inside the loop using rbin(i). If you do a lot of work, you may need the more than the cache for this. In that case, when you need rbin(i+1), you will get a cache miss. Try doing this: ``icount = 0 ; do i = 1, numcells ; if (rbin(i)%np == 0) cycle ; icount = icount +1 ; end do ; print *, icount ;`` to see if you still have the cache miss. – Anthony Scemama May 13 '15 at 08:00
  • @AnthonyScemama The cache miss goes away on the if-statement if I replace the loop contents with your suggestions. However, the `icount = icount+1` generates a cache miss every time. If the cache miss is based on the work I am doing in the loop, is there a way to improve cache utilization here? – cstraats May 13 '15 at 15:03
  • OK, good. Cachegrind is a cache simulator. In this particular case, I wouldn't trust this result because on a physical CPU icount would stay in a register until the end of the loop. I don't see any reason why there can be a cache miss. Could you look at the hardware counters on a physical CPU using ``perf`` or ``Likwid``? – Anthony Scemama May 13 '15 at 19:17
  • To improve the cache utilization, if the loop makes a lot of different things you can split it in multiple loops. When you split the loop, do it such that the number of different memory locations is minimal in each loop. For example, if you use 12 different arrays in your main loop, you can split it into 4 loops where each loop accesses only 3 arrays. However, this may not always be possible. Also, try to find predictible (contiguous if possible) memory access patterns and avoid indirections like ``A(B(i))``. Good luck! – Anthony Scemama May 13 '15 at 19:21
  • If you include the SEQUENCE keyword in fortran types, it forces the contents to be contiguous. This may help with your aim of "improving cache utilization ... with derived types in general" by making it more predictable. – Ed Smith May 14 '15 at 10:24
  • Thank you for the suggestions. I will play around with this and see what I can learn. @AnthonyScemama Why is it bad to do things like `A(B(i))` ? I do this frequently to avoid using an excess of dummy variables. – cstraats May 14 '15 at 20:17

0 Answers0