I'm trying to understand what it takes to support fast vectorized linear algebra computations for matrices and vectors of arbitrary size. From what I understand about the x86 processor architectures, they contain special registers of limited size. These registers allow for floating point numbers to be loaded and operations to be broadcast across the registers. How do you get around the limited size efficiently?
I was looking at the OpenBLAS source code to figure this out, but despite looking at the dev docs, couldn't figure out the general flow for a simple operation such as gemv
.