0

I'm trying to understand what it takes to support fast vectorized linear algebra computations for matrices and vectors of arbitrary size. From what I understand about the x86 processor architectures, they contain special registers of limited size. These registers allow for floating point numbers to be loaded and operations to be broadcast across the registers. How do you get around the limited size efficiently?

I was looking at the OpenBLAS source code to figure this out, but despite looking at the dev docs, couldn't figure out the general flow for a simple operation such as gemv.

Seanny123
  • 8,776
  • 13
  • 68
  • 124

1 Answers1

0

OpenBLAS relies on kernels to perform these operations efficiently. In this context, a "kernel" is assembly code specifically written for a linear algebra operation. For example, see these kernels for x86-64 and ARM64 for the gemv operation.

Seanny123
  • 8,776
  • 13
  • 68
  • 124