0

I am working on a large fortran code and before to compile with fast options (in order to perform test on large database), I usually compile with "warnings" options in order to detect and backtrace all the problems.

So with the gfortran -fbacktrace -ffpe-trap=invalid,zero,overflow,underflow -Wall -fcheck=all -ftrapv -g2 compilation, I get the following error:

Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation.
Backtrace for this error:
 #0  0x7fec64cdfef7 in ???
 #1  0x7fec64cdf12d in ???
 #2  0x7fec6440e4af in ???
 #3  0x7fec64a200b4 in ???
 #4  0x7fec649dc5ce in ???
 #5  0x4cf93a in __f_mod_MOD
    at /f_mod.f90:132
 #6  0x407d55 in main_loop_
    at main.f90:419
 #7  0x40cf5c in main_prog
    at main.f90:180
 #8  0x40d5d3 in main
    at main.f90:68

And the portion of the code f_mod.f90:132 is containing a where loop:

! Compute s parameter
do i = 1, Imax
   where (dprim .ne. 1.0)
      s(:,:,:, :) = s(:,:,:, :) +vprim(:,:,:, i,:)*dprim(:,:,:, :)*dprim(:,:,:, :)/(1.0 -dprim(:,:,:, :))
   endwhere
enddo

But I do not see any mistake here. All the other locations are the calls of the subroutine leading to this part. And of course, since it is a SIGFPE error, I have to problem at the execution when I compile gfortran -g1. (I use gfortran 6.4.0 on linux)

Moreover, this error appears and disappears with the modifications of completely different part of the code. Thus, the problem comes from this where loop ? Or from somewhere else and the backtrace is wrong ? If it is the case how can I find this mistake?

EDIT:

Since, I can not reproduce this error in a minimal example (they are working), I think that the problem comes for somewhere else. But how to find the problem in a large code ?

R. N
  • 707
  • 11
  • 31
  • 1
    Please post the actual messages and stacktraces and try to isolate a [mcve]. – Vladimir F Героям слава May 23 '18 at 15:52
  • 1
    *"Moreover, this error appears and disappears with the modifications of completely different part of the code."* This looks suspicious and calls for a real MCVE even more. – Vladimir F Героям слава May 23 '18 at 15:53
  • I edited my question with the complete message and backtraces. But I am not able to produce a MCVE nor to isolate the problem. This is why I came for help. I know it is not easy to help when there is no MCVE but I am running out of ideas. – R. N May 23 '18 at 16:26
  • I would expand the WHERE statement. The masking could hide your issue. At least to debbug the code. – cauchi May 23 '18 at 16:27
  • Print the actual sizes of the different arrays used in the where statement (i.e. print the shape and preferably also the lbounds and ubounds)? – albert May 23 '18 at 16:44
  • When constructing your MCVE, be mindful that floating point exceptions may be raised during the next floating point operation executed after the problematic operation. Have a look at what your code is doing in the statements preceding the where. – IanH May 23 '18 at 20:43
  • If your syntax causes your compiler to create temporary arrays and this sometimes overflows stack you might fix that. – tim18 May 24 '18 at 00:07
  • All bounds, shape, array seems ok. And the previous operations does not bring me warnings. – R. N May 24 '18 at 10:12
  • Though a wild guess, isn't "where (dprim .ne. 1.0)" an unstable mask operation because of comparison of floating-point numbers? Does "where ( abs( dprim - 1.0 ) > eps )" change something, possibly? – roygvib May 24 '18 at 11:37
  • @roygvib I do not think that the comparison is really unstable. But "where ( abs( dprim - 1.0 ) > eps )" changes something. It removes the error for epsilon greater than 0.035. – R. N May 24 '18 at 12:42

1 Answers1

5

As the code is dying with a SIGFPE, use each of the individual possible traps to learn if it is a FE_DIVBYZERO, FE_INVALID, FE_OVERFLOW, or FE_UNDERFLOW. If it is an underflow, change your mask to '1 - dprim .ne. 0'.

PS: Don't use array section notation when a whole array reference can be used instead.

PPS: You may want to compute dprim*drpim / (1 - dprim) outside of the do-loop as it is loop invariant.

Steve
  • 301
  • 1
  • 3
  • I am not sure this answers the question. What does the OP do wrong? – Vladimir F Героям слава May 23 '18 at 18:52
  • @Vladimir, with out access to the code, one can only guess at the cause of the SIGFPE. If OP does -ffpe-trap=invalid, this traps for an NaN in the expression. OP can now look for how one of s, dprim, vprim got a NaN value. Likewise, what I suspect is the SIGFPE is caused by -ffpe-trap=underflow. Again without access to code, I doubt OP actually needs to worry about trapping on underflow as gfortran supports gradual undeflow unless one uses an IEEE module to change the mask. – Steve May 23 '18 at 19:17
  • Does underflow print *erroneous arithmetic operation*? Anyway, that was my point, why answer when no-one can know the real answer? You just risk downvotes or flags. – Vladimir F Героям слава May 23 '18 at 19:38
  • 1
    The error message for SIGFPE is coming from the operating system, not gfortran, so have no idea. My only interest is helping OP understand the his/her problem. I thought "Don't use options that you don't understand" was an inappropriate response. – Steve May 23 '18 at 19:54
  • It seems to be an underflow issue. For now I solved it by calling ieee_set_halting_mode to set off the traps then do the where loop before to call again ieee_set_halting_mode to set on. Is there a better solution ? – R. N May 24 '18 at 10:15
  • 1
    @RN, given Vladimir's objections to my giving you ideas on how to isolate the problem, I hesitate to offer any additional help. Just ask yourself if underflow is an actual issue for my algorithm? If no, then remove the flag from -ffpe-trap=. – Steve May 24 '18 at 18:07