Calling multiple subroutines repetitively or simply calling an extensive subroutine once

Question

In my CFD Solver, several extensive computations must be applied throughout the domain on each node depending on indices i, j, k, and l. The domain is 3-D and have a resolution of IMAX + 1 by JMAX + 1 by KMAX + 1.

My problem is about repetitive implementation of these very extensive blocks.

Which of the two following methods is more efficient and creates less processing load?

Method 1

MODULE module_of_method_1
    IMPLICIT NONE
    PRIVATE

    INTEGER, PARAMETER, PUBLIC :: IMIN = 0   , &
                                  IMAX = 1024, &
                                  JMIN = 0   , &
                                  JMAX = 1024, &
                                  KMIN = 0   , &
                                  KMAX = 1024, &
                                  SITE = 32
    CONTAINS
    SUBROUTINE sub_1 ()
        ! very extentise bLock 1
    END SUBROUTINE
    SUBROUTINE sub_2 ()
        ! very extentise bLock 2
    END SUBROUTINE
    SUBROUTINE sub_3 ()
        ! very extentise bLock 3
    END SUBROUTINE
END MODULE

PROGRAM driver_of_method_1
    USE module_of_method_1

    IMPLICIT NONE

    INTEGER :: I, J, K, L

    DO k = KMIN, KMAX
        DO j = JMIN, JMAX
            DO i = IMIN, IMAX
                DO l = 0, SITE
                    SELECT CASE (case_expression(i, j, k, l))
                    CASE (case_selector_1)
                        CALL sub_1 ()
                    CASE (case_selector_2)
                        CALL sub_2 ()
                        CASE DEFAULT
                        CALL sub_3 ()
                    END SELECT
                END DO
            END DO
        END DO
    END DO
END PROGRAM

Method 2

MODULE module_of_method_2
    IMPLICIT NONE
    PRIVATE

    INTEGER, PARAMETER :: IMIN = 0   , &
                          IMAX = 1024, &
                          JMIN = 0   , &
                          JMAX = 1024, &
                          KMIN = 0   , &
                          KMAX = 1024, &
                          SITE = 32
    CONTAINS
    SUBROUTINE only_one_subroutine ()
        INTEGER :: I, J, K, L

        DO k = KMIN, KMAX
            DO j = JMIN, JMAX
                DO i = IMIN, IMAX
                    DO l = 0, SITE
                        SELECT CASE (case_expression(i, j, k, l))
                        CASE (case_selector_1)
                            ! very extentise bLock 1
                        CASE (case_selector_2)
                            ! very extentise bLock 2
                            CASE DEFAULT
                            ! very extentise bLock 3
                        END SELECT
                    END DO
                END DO
            END DO
        END DO
    END SUBROUTINE
END MODULE

PROGRAM program_of_method_2
    USE module_of_method_2

    IMPLICIT NONE

    CALL only_one_subroutine ()
END PROGRAM

I prefer method 1, since it is a kind of top-down design with simpler debugging, developing, and maintenance, but I concern about processing load of this method.

What measurements have you made so far to contribute to your thinking about the processing loads of the two options you outline ? — High Performance Mark, Sep 17 '16 at 18:10
Actually no measurement. In the past, I implemented both methods and method 1 seemed more time-consuming. — Shaqpad, Sep 17 '16 at 18:13
So you us ask to do the work for you? You must do some measurements. Your code is too incomplete and there is no obvious reason for a big difference in performance. — Vladimir F Героям слава, Sep 17 '16 at 18:22
@VladimirF No of course not. Experienced programmers can answer me without doing the work for me. — Shaqpad, Sep 17 '16 at 18:26
Being experienced does not mean one has a magic crystal ball. — Vladimir F Героям слава, Sep 17 '16 at 18:32
@VladimirF So I appreciate your helpful comments and wait for others. — Shaqpad, Sep 17 '16 at 18:35
Extremely important will be your strategies for parallelization. Also, your strategies for vectorizing inner loops and minimizing data movement. — tim18, Sep 17 '16 at 19:36
Your question is not a question about the Fortran language, but it is a more general question about how compilers can produce efficient code. I would suggest to replace the ``Fortran`` tag by ``compilation``, and reformulate it in pseudo-code. You will then have access to people with crystal balls powerful enough to display the answer to your question. — Anthony Scemama, Sep 17 '16 at 20:50

score 1 · Answer 1 · answered Sep 17 '16 at 20:47

If your subroutines sub_1, sub_2,... are in the same file as the driver routine, then the compiler (with no specific option) has all the info to choose to inline or not the functions. If you inline the subroutines yourself, you will let the compiler any choice. In the best case, it will be the good choice to do the inlining and you will see no difference because the compiler will choose it also. In the worst case, you will see a slow down because the compiler will choose not to inline.

Of course, I suppose that you don't have a badly compiled executable (because your compiler is a bad one or because it has unadapted options like -O0).

Generally it is better to let the compiler choose the best strategy, and this may depend on your architecture and on your code. The choices may not be the same for a Xeon or a Power CPU. The best you can do is to give as much info as you can to the compiler through options and directives (the man page is a good start) and let it do its job.

Calling multiple subroutines repetitively or simply calling an extensive subroutine once

1 Answers1