I have scanned many sources (~20 books and dozens of web pages). Hard luck I missed something really important. The question I posted is indeed incorrect and comes from my initial high expectation about array operations in fortran.
The answer I would expect is: there are no tools to write short, readable code in fortran with automatic parallelization (to be more precise: there are, but those are proprietary libraries).
The list of intrinsic functions available in fortran is quite short
(link), and consists only of functions easily mapped to SIMD ops.
There are lots of functions that one will be missing.
- while this could be resolved by separate library with separate implementation for each platform, fortran doesn't provide such. There are commercial options (see this thread)
Brief examples of missing functions:
no built-in array sort
or unique
. The proposed way is to use this library, which provides single-threaded code (forget threads and CUDA)
cumulative sum / running sum. One trivially can implement it, but the resulting code will never work fine on threads/CUDA/Xeon Phi/whatever comes next.
bincount, numpy.ufunc.at, numpy.ufunc.reduceat (which is very useful in many applications)
In most cases fortran provides 2x speed up even with simple implementations, but the code written will always be one-threaded, while matlab/numpy functions can be reimplemented for GPU or other parallel platform without any effort from user side (which occasionally happened to MATLAB, also see gnumpy, theano and parakeet)
To conclude, this is bad news for me. Fortran developers really care about having fast programs today, not in the future. I also can't lock my code on proprietary software. And I'm still looking for appropriate tool. (Julia is current candidate)
See also: