The CUDA 11 features announcement, it's said that there are now:
New link time optimization capabilities
what link-time optimizations does nvcc actually employ (e.g. relative to the LTO capabilities in host-side code with g++ or clang++)?
Also - is there something one needs to do to get LTO enabled, or does it always occur (unlike with host-side code where you need to compile with an -flto
switch?