I haven't used VTune specifically, but that probably means that's the bottleneck (so I guess that's why it chooses to show it in red). Yes, that's generally what you want for a matmul: keeping the execution units fed.
In the general case for other algorithms, sometimes algorithmic optimizations are possible, perhaps trading higher latency or a small lookup-table for fewer uops. Showing this in red could prompt you to look for such optimizations, so it makes some sense for VTune to work that way.
In your case, about the best you could hope for is maybe reducing loop overhead, if those 3+ ports used aren't all 2x load/2x FMA / store.