I have a pretty large project converted to Numba, and after run #1 with @nb.njit(cache=True, parallel=True, nogil=True), well, it's slow on run #1 (like 15 seconds vs. 0.2-1 seconds after compiling). I realize it's compiling byte code optimized for the specific PC I'm running it on, but since the code is distributed to a large audience, I don't want it to take forever compiling the first run after we deploy our model. What is not covered in the documentation is a "generic x64" cache=True method. I don't care if the code is a little slower on a PC that doesn't have my specific processor, I only care that the initial and subsequent runtimes are quick, and prefer that they don't differ by a huge margin if I distribute a cache file for the @njit functions at deployment.
Does anyone know if such a "generic" x64 implementation is possible using Numba? Or are we stuck with a slow run #1 and fast ones thereafter?
Please comment if you want more details; basically it's around a 50 lines of code function that gets JIT compiled via Numba and afterwards runs quite fast in parallel with no GIL. But I'm willing to give up some extreme optimization if the code can work in a generic form across multiple processors. As where I work, the PCs can vary quite a bit in terms of how advanced they are.
I looked briefly at the AOT (ahead of time) compilation of Numba functions, but these functions, in this case, have so many variables being altered I think it would take me weeks to decorate properly the functions to be compiled without a Numba dependency. I really don't have the time to do AOT, it would make more sense to just rewrite in Cython the whole algorithm, and that's more like C/C++ and more time consuming that I want to devote to this project. Unfortunately there is not (to my knowledge) a Numba -> Cython compiler project out there already. Maybe there will be in the future (which would be outstanding), but I don't know of such a project out there currently.