You probably have only two choices:
Use an instruction-cycle accurate simulator; the problem here is that effectively simulating peripherals and external stimulus can be complex or impossible.
Use a peripheral hardware timer. In most cases you will not be able to run such a timer at the typical core clock rate of an ARM9, and there will be an over head in servicing the timer either side of the period being timed, but it can be used to give execution time over larger or longer running sections of code, which may be of more practical use than cycle count.
While cycle count may be somewhat scalable to different clock rates, it remains constrained by memory and I/O wait states, so is perhaps not as useful as it may seem as a performance metric, except at the micro-level of analysis, and larger performance gains are typically to be had by taking a wider view.