I don't think you can easily achieve what you want: when doing a full matrix multiplication, the main speedup of the current matrix multiplication algorithms comes from the fact that they do the multiplication block-wise and leverage the CPU's SIMD instructions as well as are optimized towards the CPU's various cache sizes. An optimized algorithm in a library is easily 10-100 times faster than doing it the naive way by 3 nested loops. However, those algorithms don't calculate individual elements one at a time, but calculate multiple elements all at once. (As I said: they operate block-wise.)
If in your case you only want to calculate a certain subset of elements, unless the elements are organized in a systematic way (e.g. only the left N columns, or top M rows, or something like that), then the only way to restrict the calculation to just those elements is to explicitly write all 3 nested loops. And that will be less efficient than using the optimized GEMM algorithms modern libraries provide.
The best you can probably do is probably roughly what you're doing in your OpenMP example, because there you're already using the product implementation of your library for the inner-most loop. (You could probably avoid some memory allocations for the index vector if you do some clever tricks, but other than that I don't see much optimization potential there.)
If there is some additional systematic towards the elements (e.g. nearly all cluster in a specific block of the matrix), then you could probably optimize that a bit more for your specific use case, using that information. (Assuming you know this behavior in advance and don't have to detect that during runtime.)
But to be honest, if the numbers you provide are realistic, and you are calculating 1000 elements out of 10'000, then I don't think you'll ever be faster than using the full GEMM implementation of a decent library to calculate all elements, even if most are not needed, just because of how much speedup the full algorithms provide compared to calculating the elements individually. You either need to only require a calculation of a lot less elements than that, or you need to have a very systematic way of determining which elements need to be calculated. (That doesn't depend on evaluating a condition for each element individually.)
Perhaps if you provide more information about the actual problem you want to solve here, there may be a clever way of achieving that - but when looking at just the matrix multiplication part, I doubt you'll be able to speed that up much, if at all.