0

I use Fortran to do some scientific computation. I use HPC. As we know, when we submit jobs in a HPC job scheduler, we also specify the wall clock time limit for our jobs. However, when the time is up, if the job is still writing output data, it will be terminated and it will cause some 'NUL' values in the data, causing trouble for the post-processing:

enter image description here

So, could we set an internal mechanism that our job can stop itself peacefully some time before the end of HPC allowance time?

Related Question: How to skip reading "NUL" value in MATLAB's textscan function?

Community
  • 1
  • 1
zlin
  • 195
  • 2
  • 11
  • 1
    http://gcc.gnu.org/onlinedocs/gcc-4.5.2/gfortran/DATE_005fAND_005fTIME.html – agentp Mar 28 '17 at 00:23
  • You can system_clock() but I don't really understand what happens, why there is NULL and what you want to do. – Vladimir F Героям слава Mar 28 '17 at 06:25
  • 2
    A comment as I'm guessing what you mean. If you want to automatically detect what your limit on time in a batch job is and automatically close down once you approach that, well no there is no standard way. You will first have to read your batch system's documentation to work out how you can find out the limit or that your job, and then work out a suitable way to pass that to your Fortran (note spelling, it's been lower case for over 25 years), and then how to detect you are close to that limit and how to close down "cleanly". – Ian Bush Mar 28 '17 at 07:07
  • 2
    Oh, and I'm surprised that you get the above when the job crashes, though of course anything is possible. The only time I've seen similar is when multiple processes were accessing a direct access file on a Lustre file system on a Cray, when indeed we got strange NULs in the file presumably dues to the file cacheing getting confused (of course multiple processes accessing a file via the Fortran I/O mechanism is illegal) – Ian Bush Mar 28 '17 at 08:12
  • Thanks Ian, now I got what was the point. – Vladimir F Героям слава Mar 28 '17 at 10:54

1 Answers1

2

After realizing what you are asking I found out that I implemented similar functionality in my program very recently (commit https://bitbucket.org/LadaF/elmm/commits/f10a1b3421a3dd14fdcbe165aa70bf5c5001413f). But I still have to set the time limit manually.

The most important part:

time_stepping%clock_time_limit is the time limit in seconds. Count the number of system clock ticks corresponding to that:

    call system_clock(count_rate = timer_rate)
    call system_clock(count_max = timer_max_count)   

    timer_count_time_limit = int( min(time_stepping%clock_time_limit &
                                        * real(timer_rate, knd),  &
                                      real(timer_max_count, knd) * 0.999_dbl) &
                                , dbl)  

Start the timer

call system_clock(count = time_steps_timer_count_start)  

Check the timer and exit the main loop with error_exit set to .true. if the time is up

  if (mod(time_step,time_stepping%check_period)==0) then
    if (master) then
      error_exit = time_steps_timer_count_2 - time_steps_timer_count_start > timer_count_time_limit
      if (error_exit) write(*,*) "Maximum clock time exceeded."
    end if

    MPI_Bcast the error exit to other processes

    if (error_exit) exit
  end if

Now, you may want to get the time limit from your scheduler automatically. That will vary between different job scheduling softwares. There will be an environment variable like $PBS_WALLTIME. See Get walltime in a PBS job script but check your scheduler's manual.

You can read this variable using GET_ENVIRONMENT_VARIABLE()

Community
  • 1
  • 1