0

I have a (for me) very weird segmentation error. At first, I thought it was interference between my 4 cores due to openmp, but removing openmp from the equation is not what I want. It turns out that when I do, the segfault still occurs.

What's weird is that if I add a print or write anywhere within the inner-do, it works.

subroutine histogrambins(rMatrix, N, L,  dr, maxBins, bins)
    implicit none;

    double precision, dimension(N,3), intent(in):: rMatrix;
    integer, intent(in) :: maxBins, N;
    double precision, intent(in) :: L, dr;

    integer, dimension(maxBins, 1), intent(out)  :: bins;

    integer :: i, j, b;  
    double precision, dimension(N,3) :: cacheParticle, cacheOther;
    double precision :: r;
    do b= 1, maxBins
        bins(b,1) = 0;
    end do 
    !$omp parallel do &
    !$omp default(none) &
    !$omp firstprivate(N, L, dr, rMatrix, maxBins) &
    !$omp private(cacheParticle, cacheOther, r, b) &
    !$omp shared(bins) 
    do i = 1,N 
        do j = 1,N
            !Check the pair distance between this one (i) and its (j) closest image 
            if (i /= j) then 
                !should be faster, because it doesn't have to look for matrix indices 
                cacheParticle(1, :) = rMatrix(i,:); 
                cacheOther(1, :) = rMatrix(j, :);  

                call inbox(cacheParticle, L);
                call inbox(cacheOther, L);  
                call closestImage(cacheParticle, cacheOther, L);    
                r = sum( (cacheParticle - cacheOther) * (cacheParticle - cacheOther) ) ** .5; 
                if (r /= r) then
                    ! r is NaN 
                     bins(maxBins,1) = bins(maxBins,1) + 1;
                else   
                     b = floor(r/dr);
                     if (b > maxBins) then
                         b = maxBins;
                     end if     

                     bins(b,1) = bins(b,1) + 1;
                end if
            end if
        end do
    end do
    !$omp end parallel do
end subroutine histogramBins 

I enabled -debug-capi in the f2py command:

f2py --fcompiler=gfortran --f90flags="-fopenmp -fcheck=all" -lgomp --debug-capi --debug -m -c modulename module.f90; 

Which gives me this:

debug-capi:Fortran subroutine histogrambins(rmatrix,&n,&l,&dr,&maxbins,bins)'
At line 320 of file mol-dy.f90
Fortran runtime error: Aborted

It also does a load of other checking, listing arguments given and other subroutines called and so on.

Anyway, the two subroutines called in are both non-parallel subroutines. I use them in several other subroutines and I thought it best not to call a parallel subroutine with the parallel code of another subroutine. So, at the time of processing this function, no other function should be active.

What's going on here? How can adding "print *, ;"" cause a segfault to go away?

Thank you for your time.

Daimonie
  • 83
  • 11
  • Weirdly, it seems that the segmentation fault goes away if I add a "if r > L" maxBins. Really odd; it's already capped at maxBins isn't it? – Daimonie Mar 01 '15 at 19:58
  • 4
    *What's weird is that if I add a print or write anywhere within the inner-do, it works.* This is frequently a symptom of either failure to match procedure dummy and actual arguments or of accessing an element of an array outside its declared bounds. Re-compile and re-run with compile-time checking for the former and run-time checking for the latter. Oh, and while I'm commenting, you don't need ';' at the end of a statement when it is the only statement on a line. Have you programmed in C before ? – High Performance Mark Mar 01 '15 at 20:00
  • 5
    Actually, the fact that a segfault is affected by a print statement is *extremely common*. It is a sign of stack corruption. Double check that you are calling all your procedures right. – Vladimir F Героям слава Mar 01 '15 at 20:01
  • Huh. After outputting all input arguments, it somehow has rMatrix as a list or something? Instead of outputting the matrix I expected, it just outputs a load of tab-seperated numbers. That'd be the dummy/actual arguments, I guess. I did program C++ and PHP before, so I decided to stick to my habit of semicolon's. – Daimonie Mar 01 '15 at 20:12
  • I seem to be unable to find any errors in the calling of the subroutine. I've printed shape(rMatrix) and similar, but I did not find anything that was off. rMatrix is 864x3 and bins is 100x1, as expected. – Daimonie Mar 01 '15 at 20:21
  • You should also check the *kind* of the arguments. I once had issues with some old code which tried to pass a `real` variable to a `double precision` dummy argument - that was fun! [Am I glad that we have ported everything to modules now;-)] – Alexander Vogt Mar 01 '15 at 20:30
  • all of the arguments are used in other functions as well (exact same variable), so they should have the right types. In fact, I used an earlier skeleton of another openmp-using function that does most of the main work for it. – Daimonie Mar 01 '15 at 20:47
  • I removed the sum( .... * ....) ** 0.5 function and replaced it with a do k =1, 3 (component wise calculation). It no longer gives a segfault. – Daimonie Mar 01 '15 at 21:24

1 Answers1

2

It's not unusual for print statements to impact - and either create or remove the segfault. The reason is that they change the way memory is laid out to make room for the string being printed, or you will be making room for temporary strings if you're doing some formatting. That change can be sufficient to cause a bug to either appear as a crash for the first time, or to disappear.

I see you're calling this from Python. If you're using Linux - you could try following a guide to using a debugger with Fortran called from Python and find the line and the data values that cause the crash. This method also works for OpenMP. You can also try using GDB as the debugger.

Without the source code to your problem, I don't think you're likely to get an "answer" to the question - but hopefully the above ideas will help you to solve this yourself.

Using a debugger is (in my experience) considerably less likely to have this now-you-see-it-now-you-don't behaviour than with print statements (almost certainly so if only using one thread).

David
  • 756
  • 5
  • 10
  • Thanks! I will look into it some more in the next assigment. The function miraculously started working when I rewrote it. If I remember correctly, it was calling some weird things (b=0, for instance). – Daimonie Mar 03 '15 at 13:58