I am working with a function that transposes a matrix. My question is why does loop tiling decrease the running time especially of that function?
I understand why it works when, for example, multiplying matrices as the elements are being reused, but why is tiling faster when we still only access each element once in the transpose function?