I've seen some demos of @cupy.fuse which is nothing short of a miracle for GPU programming using Numpy syntax. The major problem with cupy is that each operation like adding is a full kernel launch, then kernel free. SO a series of adds and multiplies, for example, pay a lot of kernel pain. ( This is why one might be better off using numba @jit)
@cupy.fuse() appears to fix this by merging all the operations inside the function to a single kernel creating a dramatic lowering of the launch and free costs.
But I cannot find any documentation of this other than the demos and the source code for cupy.fusion class.
Questions I have include:
- Will cupy.fuse aggressively inline any python functions called inside the function the decorator is applied to, thereby rolling them into the same kernel?
this enhancement log hints at this but doesn't say if composed functions are in same kernel or simply just allowed when called functions are also decorated. https://github.com/cupy/cupy/pull/1350
If so, do I need to decorate those functions with @fuse. I'm thinking that might impair the inlining not aid it since it might be rendering those functions into a non-fusable (maybe non-python) form.
If not, could I get automatic inlining by first decorating the function with @numba.jit then subsequently decorating with @fuse. Or would again the @jit render the resulting python in a non-fusable form?
What breaks @fuse? What are the pitfalls? is @fuse experimental and not likely to be maintained?
references:
https://gist.github.com/unnonouno/877f314870d1e3a2f3f45d84de78d56c
https://www.slideshare.net/pfi/automatically-fusing-functions-on-cupy
https://github.com/cupy/cupy/blob/master/cupy/core/fusion.py
https://docs-cupy.chainer.org/en/stable/overview.html
https://github.com/cupy/cupy/blob/master/cupy/manipulation/tiling.py