We generally recommend using "-fast", "-O3", or "-fast -O3" for general performance. "-Mconcur" enables auto-parallelization which may or may not help. In general it's better to use explicit parallelization via OpenACC or OpenMP directives, or Fortran "DO CONCURRENT".
Other potentially useful optimization flags:
-Mnouniform - Allow non-uniform computation of SIMD and scalar code. Faster, but may reduce some accuracy.
-Mstack_arrays - Allocate automatic arrays on the stack rather than the heap. Faster but uses more stack. You may need to increase the program's stack in your shell environment.
-Bstatic-nvidia - Link the compiler runtime libraries statically rather than dynamic.
-Mfprelaxed - Allow use of faster but reduced precision intrinsics and floating-point computations.
-mp[=gpu] - Enable OpenMP directives and optionally enable target offload to GPUs.
-acc[=multicore] - Enable OpenACC directives, defaults to offload to GPUs, use "multicore" to target multicore CPUs.
-stdpar[=gpu] - Enable parallelization of DO CONCURRENT to host or GPU.
The debugging flags are fine, though "-C" and "-Mbounds" both enable bounds checking so only one is needed.
Another useful flag to use during development is "-Minfo". The compiler will give feedback messages on what optimization it's applying or not able to apply. It can be a lot of messages, so you can use sub-options to limit the output to particular types such as "-Minfo=vect" to see which loop are or are not getting vectorized. See "nvfortran -help -Minfo" for the full list of sub-options.