The amount of time it takes to execute that function can vary widely depending on your compiler and settings. Since your function does nothing an optimizer would turn this function into a simple bx lr, which takes very little time. If you are able to measure the time then you are not optimizing (and your overall execution of this and other parts of your code will vary even more).
Assuming you solve that problem in a deterministic and repeatable manner, you can get a rough idea of how long it takes to execute by executing it and timing it using a reference clock. the timers in the cortex-m4 are an excellent choice.
Any time you change the way you use this code, turn on a cache, or change the processor clock, change the timing settings on the flash, etc, you will need to re-tune your delay function.
It is far easier to just use one of the timers directly to perform a delay, and the accuracy is improved by quite a bit. Prevents having to continue to maintain the counter loop code and/or calls to it.