Q : Why is this not vectorizing?
The evil is the "branching…cannot vectorize" - it relates to this instruction:
if( ( i < gDIM ) && ( j < fDIM ) ){ ... }
Efficient SIMD-instructions based vectorisation means all code-execution flows are not "divergent" (branched) and do "execute" the very same data/instruction (i.e. data elements SIMD-"glued" into Vectors of DATA, put into wide-enough, CPU, SIMD-friendly, registers, that get computed at once by a single SIMD-friendly instruction - i.e. the very same for each thread-in-a-pack SIMD-friendly instruction, i.e. not if(){...}else{...}
-diverged into different, "divergent" flow-of different sequences of different instructions for different data-elements
It is principally impossible to want do different operations for different parts of the data, aligned into the SIMD-friendly CPU register - one and only one SIMD-friendly instruction can be executed at once for all vector-components stored into the SIMD-friendly CPU-register.
Hardware details on integer and floats SIMD-vector instructions vary, as does the resulting micro-ops latency, SIMD-processor specific details form compilator do matter a lot, yet the principle of avoiding divergent paths is common for the automated SIMD-vectorisation in the compiler phase. For more deails on SIMD-instructions and their further performance-limiting properties may read and learn from Agner