2

I am trying to run a code on vortex simulations in parallel using OpenMP. These are similar to particle simulations where at each time step, the position of a vortex at the next time step has to be computed from its velocity which is determined by the positions of all the other vortices at the current time step. The vortices are deleted once they leave the domain. I compare the number of vortices at each time step for the parallel version of code with the serial version of code, and run each version multiple times.

For the serial versions, vortex counts match exactly at every time step. For the parallel case, all the runs match with the serial case for a few tens of time steps, post which, each parallel run shows a difference but remains within a 7-10% error bound with the serial case (as can be seen in the result link below). I know that this may be because of the round off errors in the parallel case owing from the difference in the order of computational steps due to distribution among the different threads, but should the error really be so high as 10%?

I have only used the reduction clause in a parallel do construct. The only parallel region in the whole code is within a function vblob(), which is inside a module, which I call from a main code. All function calls within vblob() are ixi(), fxi() are outside this module.

function vblob(blobs,xj,gj)
    complex(8), intent(in) :: blobs(:,:), xj
    complex(8) :: delxi, delxic, di, gvic, xi
    real(8), intent(in) :: gj
    real(8) :: vblob(2)
    integer :: p

    gvic = 0.0; delxi = 0.0; delxic = 0.0; di = 0.0; xi = 0.0
    !$omp parallel do private(xi,delxic,delxi,di) shared(xj) reduction(+:gvic)
    do p = 1, size(blobs,1)
      xi = ixi(blobs(p,1))
      delxic = xj-conjg(xi)
      delxi = xj-xi
      di = del*fxi(xi)
      gvic = gvic + real(blobs(p,2))*1/delxic
      if (abs(delxi) .gt. 1.E-4) then
        gvic = gvic +  (-1)*real(blobs(p,2))*1/delxi
      end if
    end do
    !$omp end parallel do
    gvic = j*gvic*fxi(xj)/(2*pi)
    vblob(1) = real(gvic)
    vblob(2) = -imag(gvic)

  end function vblob

If the way I have constructed the parallel code is wrong, then errors should show up from the first few time steps itself, right?

(As can be seen in this result, the 'blobs' and 'sheets' are just types of vortex elements, the blue line is the total elements. P and S stand for Parallel and serial respectively and R stands for runs. THe solid plot markers are the serial code and the hollow ones are the three runs of the parallel code)

EDIT: When i change the numerical precision of my variables to real(4) instead, the divergenec in results happens at an earlier time step than the real(8) case above. SO its clearly a round off error issue.

TLDR: I want to clarify this with anyone else who has seen such a result over a range of time steps, where the parallel code matches for the first few time steps and then diverges?

marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
  • *I want to clarify this with anyone else who has seen such a result over a range of time steps, where the parallel code matches for the first few time steps and then diverges?* That's kind of the definition of *divergence*, it starts the same then diverges after a while. I don't immediately see anything wrong with the code, and without a lot more knowledge of your application it's very difficult to state whether or not it is 'wrong' in any sense. But its behaviour is within my experience (a decade or two of computational EM on HPC). – High Performance Mark Jul 11 '20 at 11:58
  • ... you might want to try some sensitivity analysis of the codes, and also consider that the serial code is just one sample in the entire space of potential paths for your program's execution, it is not (necessarily) the *ground truth*. – High Performance Mark Jul 11 '20 at 12:00
  • Its great to correspond with an experienced person like you, Mark. I'm quite new to the parallel computing field. I know that even the serial code contains round-off errors, and if that does not match with experimental results with reasonale accuracy, i will know the solution is wrong. But in your experience, have you seen such a solution for parallel case, where errors build up over time (over many iterations, where variables in each iteration are derived from the prev)? is my hypothesis that this could be a roundoff error issue correct? also looking at the image, what are your thoughts? – Joel Vasanth Jul 11 '20 at 12:20
  • *But in your experience, have you seen such a solution for parallel case, where errors build up over time ?* Yes *is my hypothesis that this could be a roundoff error issue correct?* It certainly could be. *also looking at the image, what are your thoughts?* I never follow off-site links, there are some shocking images on the Internet. – High Performance Mark Jul 11 '20 at 13:50
  • You could make gvic an array, and then sort it before summing it from low to large values. That would at least give a way to understand if it is a precision problem... but it obviously will not speed things up... it just gives clues as to whether the parallel gvic is the same as the serial version, and where a sorted sum is equal to an unsourced sum. Or make gvic a quad precision value to sum into... which is an easier place to start. – Holmz Jul 12 '20 at 01:30

1 Answers1

7

Your code essentially sums up a lot of terms in gvic. Floating-point arithmetic is not associative, that is, (a+b)+c is not the same as a+(b+c) due to rounding. Also, depending on the values and the signs on the terms, there may be a serious loss of precision in each operation. See here for a really mandatory read on the subject.

While the sequential loop computes (given no clever compiler optimisations):

gvic = (...((((g_1 + g_2) + g_3) + g_4) + g_5) + ...)

where g_i is the value added to gvic by iteration i, the parallel version computes:

gvic = t_0 + t_1 + t_2 + ... t_(#threads-1)

where t_i is the accumulated private value of gvic in thread i (threads in OpenMP are 0-numbered even in Fortran). The order in which the different t_is are reduced is unspecified. The OpenMP implementation is free to choose whatever it deems fine. Even if all t_is are summed in order, the result will still differ from the one computed by the sequential loop. Unstable numerical algorithms are exceptionally prone to producing different results when parallelised.

This is something you can hardly avoid completely, but instead learn to control or simply live with its consequences. In many cases, the numerical solution to a problem is an approximation anyway. You should focus on conserved or statistical properties. For example, an ergodic molecular dynamics simulation may produce a completely different phase trajectory in parallel, but values such as the total energy or the thermodynamic averages will be pretty close (unless there is some serious algorithmic error or really bad numerical instability).

A side note - you are actually lucky to enter this field now, when most CPUs use standard 32- and 64-bit floating-point arithmetic. Years ago, when x87 was a thing, floating-point operations were done with 80-bit internal precision and the end result would depend on how many times a value leaves and re-enters the FPU registers.

Hristo Iliev
  • 72,659
  • 12
  • 135
  • 186
  • A great answer, which is just missing the (mandatory in this context :-)) reference to "What Every Computer Scientist Should Know About Floating-Point Arithmetic" [https://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html] which has sections on summation. – Jim Cownie Jul 13 '20 at 07:42
  • 1
    Indeed, I was simply too lazy to seek out a copy of the article that is still online :) Thanks for the link, will integrate it in the answer. – Hristo Iliev Jul 13 '20 at 08:19
  • @JimCownie, sadly your link is bust too. Edit: nvm, there was an extra character screwing the URL on my iPad. – Hristo Iliev Jul 13 '20 at 08:22
  • Ah, sorry, yes, the link has pulled in the closing ']'. If you take that off you can find the document at https://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html – Jim Cownie Jul 14 '20 at 07:32