12

I have a fairly complex computational code that I'm trying to speed up and multi-thread. In order to optimize the code, I'm trying to work out which functions are taking the longest or being called the most.

I haven't really profiled code before, so I could be missing something. However, I know many existing profiling modules don't really play nice with numba's njit() decorator due to the recompiling with LLVM.

So my question would be this: What's the best way to profile code in which most functions have the njit() decorator, with a few non-jitted control functions?

I've come across data_profiler before, however it doesn't seem to be in the conda repository anymore and I wouldn't know how to build it from source in conda, or if it would still be compatible with modern versions of its dependencies.

Yoshi
  • 671
  • 8
  • 20
  • Have you tried setting `cache=True` option in `njit`? With this and a profiler like the one implemented in Spyder IDE. Internally it is using `cProfile`, which is also a good module for profiling by hand. – JE_Muc Mar 18 '19 at 11:40

1 Answers1

4

If this may help as a last resort, let's make it a try:

Having spent a few tens of man*years in QuantFX module development, both using numba and other vectorisation / jit-acceleration tools, let me share a few pieces of experience, that were considered handy for our similarly motivated profiling.

On the contrary of the mentioned data_profiler, with milliseconds, we enjoyed microsecond resolution provided as side-effect of using a ZeroMQ module, for distributed signalling / messaging infrastructure.

ZeroMQ has all its services implemented in a core-engine, called a Context, yet there is one small utility free to re-use independently of this instrumentation, a Stopwatch - a microsecond resolution timer class.

So, nothing could stop us from:

from pyzmq import Stopwatch as MyClock

aClock_A = MyClock(); aClock_B = MyClock(); aClock_C = MyClock(); print( "ACK: A,B,C made" )

# may use 'em when "framing" a code-execution block:
aClock_A.start(); _ = sum( [ aNumOfCollatzConjectureSteps( N ) for N in range( 10**10 ) ] ); TASK_A_us = aClock_A.stop()
print( "INF: Collatz-task took {0:} [us] ".format( TASK_A_us ) )

# may add 'em into call-signatures and pass 'em and/or re-use 'em inside whatever our code
aReturnedVALUE = aNumbaPreCompiledCODE(  1234,
                                        "myCode with a need to profile on several levels",
                                        aClock_A, #     several, 
                                        aClock_B, # pre-instantiated,
                                        aClock_C  #     Stopwatch instances, so as
                                        )         #  to avoid chained latencies

This way one can, if indeed pushed into using at least this, as the tool of the last resort, "hard-wire" into one's own source code any structure of Stopwatch-based profiling. The only restriction is the need to be conform the finite-state-automaton of the Stopwatch instance, where once a .start() method was called, only a .stop() method may come next and similarly, calling the .stop() method on a not yet .start()-ed instance will quite naturally throw an exception.

The common try-except-finally scaffolding will help to ascertain that all Stopwatch instances happen to become .stop()-ed again, even if exceptions may have happened.

Structure of "hard-wired" profiling depends on your code-execution "Hot-Spots under test" and even "cross-boundary" profiling of call-related overheads, spent between a native python call of the @jit-decorated numba-LLVM-ed code and starting the 1st line "inside" the numba-compiled code ( i.e. how long does it take between a call-invocation and parameter analyses, driven by a list of call-signatures or principally avoided, by enforcing a single, explicit, call-signature )

Good Luck. Hope it could help you.

user3666197
  • 1
  • 6
  • 50
  • 92