Why not (linearly) interpolate a matrix by matrix multiplication?

Question

I'm a math undergrad taking linear algebra. There was a recent court case in the news that denied some video evidence because the footage was zoomed in (the data was interpolated, it "created new pixels"). This got me thinking, how would I linearly interpolate a matrix?

I looked into it and could only find algorithms that used nested for loops, nothing that involved much linear algebra. This surprised me because I thought operations like matrix multiplication were more efficient.

Eventually I figured out how a much simpler (and more satisfying!) way to interpolate a matrix (nearest neighbor and linearly–so far) with linear algebra.

(Basically, if you have a mxn matrix A, then you construct two very simple matrices: a (2m-1)xm matrix L and a nx(2n-1) matrix R, and multiply them L * A * R to get a (2m-1)x(2n-1) matrix with interpolated rows and columns. Designing L and R is fun and easy.)

So why don't programmers use matrix multiplication to interpolate matrices? In theory, shouldn't graphics cards make this a blazing fast calculation when compared to nested for loops?

Or maybe programmers do this already, but because it's so obvious, there's just not much information published on it?

Thanks :D

Edit: It seems my loose phrasing of using nested loops rather than matrix multiplication caused a lot of confusion from. I didn't mean to imply that GPUs didn't loop. I just meant that part would be abstracted away behind some library or perhaps the GPU itself. Good software composes its functions like that. Also, programming this directly through nested loops forfeits optimizations from matrix math algorithms or the GPU.

Whether or not an algorithm for matrix products uses a nested for loop is actually irrelevant. Maybe the algorithm uses a recursive function. Maybe it uses some efficient but counter-intuitive hack, like DOOM's function for inverse squares. It doesn't really matter, and it's silly that this small ambiguity derailed the majority of the discussion.

It turns out that my understanding was mostly (completely?) right. GPUs are better for matrix math, but only for very large matrices.

And my question was thoroughly answered. FFT is much faster than O(n^3). I highly recommend that you read the answer and its comments by @datenwolf

you usually don't linearly interpolate in signal processing (and especially not in image processing). Also, why would you think a nested loop doesn't do linear algebra? Last time I've checked, the way you do matrix-matrix products, both on paper and on a computer is very much nested loops: for each element of the output matrix (that's a row and a column loop), loop over the row from the first factor and the column of the second factor and multiply and accumulate the elements. That's linear algebra, which you *implement* through nested loops. — Marcus Müller, Nov 20 '21 at 15:55
Well, I'm pretty sure software like Photoshop has an option for linear (and nearest neighbor, cubic, etc) interpolation when scaling an image. — Adam Neeley, Nov 20 '21 at 16:00
I also meant to add that I was under the impression that graphics cards are basically matrix calculators. So if you code a video game, you generally don't loop over the rows and columns of a matrix, you just say "this matrix times this matrix" and the GPU will take care of it. Please correct me if I'm wrong. — Adam Neeley, Nov 20 '21 at 16:04
you're not wrong, but if you multiply a matrix with another matrix on a GPU, then that GPU executes nested loops. A GPU can execute a lot of loops in parallel, but they're still loops. — Marcus Müller, Nov 20 '21 at 16:26
Well, GPUs are optimized to perform these specific types of matrix calculations, so I imagine that they probably use special algorithms or techniques to execute these operations in a way that is faster and more efficient than just writing a generic nested loop for a CPU, parallel or otherwise. — Adam Neeley, Nov 20 '21 at 17:03
My point is that those loops are optimized to perfect linear algebra. See https://en.wikipedia.org/wiki/Basic_Linear_Algebra_Subprograms Also, someone in the math section answered my question. Calculating the linear interpolation of large matrices via matrix multiplication ends up using very sparse matrices. So matrix multiplication is not more efficient in this case. — Adam Neeley, Nov 20 '21 at 17:13
have you actually read BLAS or netlib LAPACK source code? I have read some: It's FORTRAN full of loops. Can we stop pretending there's a way to compute a matrix matrix product without loops? Even (and especially) for sparse matrices, you loop. — Marcus Müller, Nov 20 '21 at 17:16
I never said it didn't loop. My point is just that the GPU optimizes these operations. But the fact that they are sparse matrices probably renders these optimizations irrelevant, especially because GPUs require large overhead. Maybe a GPU would be useful for more complicated interpolations, but not linear. Analogy: almost all sorting algorithms loop. Still, some are faster than others. — Adam Neeley, Nov 20 '21 at 17:28
a gpu can't optimize a matrix matrix product beyond what linear algebra says it is. And a matrix matrix product is a bunch of sums, and each of these sums is a loop. I don't know where you're getting the idea from that a GPU can do that any more *efficiently*; it can just do that embarassingly *parallel*. Still loops. If you want to deny that, I propose you produce any GPU documentation that supports your idea. — Marcus Müller, Nov 20 '21 at 17:37
Maybe GPUs just store memory efficiently for repetitive tasks like matrix multiplication. Maybe it's just algorithms efficiently harnessing parallelism. I don't know. The point is that GPUs are often used to perform certain computations including linear algebra operations more efficiently than CPUs. This is a well known fact and I won't argue about it because I don't know the details. My method for interpolation used matrices, so I thought a GPU might be able to process it faster than a CPU would for a non-matrix method. But I was (probably) wrong because it's a sparse matrix. — Adam Neeley, Nov 20 '21 at 18:25
again, GPUs don't do *anything* more efficiently, just more parallel. You're bringing up "could be" and I'm like "no, I know." — Marcus Müller, Nov 20 '21 at 18:29
Let's assume that GPUs are just parallel CPUs, like you say. Would it be more efficient for one person to do 10 simple calculations in a row, or for 10 people to each do one calculation simultaneously? Then a GPU should compute matrix multiplications more efficiently than a CPU. So please, let's move on. — Adam Neeley, Nov 20 '21 at 18:57
it would *not* be more efficient; it would be faster. The amount of work done, the product of work time times workers, is constant (in the best case, even!). Efficiency is a measure off *effort*, not time. — Marcus Müller, Nov 20 '21 at 18:58
I could write the most efficient algorithm in the world, but if I told it to sleep for an hour in the middle of it, it would no longer be efficient. What a waste of time XD See ya — Adam Neeley, Nov 20 '21 at 19:13

datenwolf · Accepted Answer · 2021-11-20T19:15:46.163

3

I figured out how a much simpler…

Well, you just described convolution in terms of matrix-matrix multiplication. Yes this works. But it's expensive and numerically unstable because it accumulates a lot of floating point rounding errors.

So why don't programmers use matrix multiplication to interpolate matrices?

Because it's inefficient. Matrix multiplication has complexity (N³).

A far more efficient, and at the same time also "perfect" (in the sense of signal processing) method is to perform a forward Fast Fourier Transform, complexity (N·log(N)), zero pad the spectrum to the desired output resolution, and perform the inverse FFT on that, again complexity (N·log(N)), so the total complexity is (2(N·log(N))). Constant factors are omitted in Big- notation, so the complexity is (N·log(N)).

The underlying math of this is the Convolution theorem.

This is far, far, FAR better than the (N³) of matrix multiplication. So that's your reason why we don't use matrix multiplication right there.

If you do a simple zero-pad you're doing interpolation with a sin(x)/x kernel. But by multiplying the spectrum with the spectrum of a different interpolation kernel you can implement any kind of interpolation you desire this way.

houldn't graphics cards make this a blazing fast calculation when compared to nested for loops?

GPUs have linear interpolators built in, as part of their texture sampler units hardware, which will outperform any software implementation.

edited Nov 20 '21 at 19:15

answered Nov 20 '21 at 18:19

datenwolf

159,371
13
185
298

I know I shouldn't be riding this point so hard, but it's worth noting that their linear interpolator units of course "just" are units that are highly parallel and loop over the points between the input points; there's no free lunch: GPUs have to do the same math. (in case of very small interpolation factors, that interpolation could be done "at once", i.e. without a loop. But if you loop over a single pixel, is it any less of a loop?) – Marcus Müller Nov 20 '21 at 18:31
Oh, wow. That is incredibly interesting. I only know the very basics of Linear Algebra. From my microscopic understanding of Fourier transforms, it is basically a basis based on sine that can be used to describe many matrices well. If I had to take a guess, the kernel describes this basis? The spectrum must refer to its eigenvalues. I will have to think about this some more to fully appreciate your answer, but it seems good to me. Thanks :D – Adam Neeley Nov 20 '21 at 18:50
Ah, I do remember something about convolutions from my differential equations class when we learned about Laplace transforms. I didn't expect to see it pop up here. – Adam Neeley Nov 20 '21 at 19:27
@AdamNeeley: I just read about your back-and-forth with Marcus in the comments on your question. So here's something that wasn't mentioned, but is vitally important to understand in relation to GPUs: *GPUs are not __optimized__ for matrix-matrix multiplications!* The GPUs made between ~2004 to 2010 were in fact built around the premise that *everything* happened in 4D vectors for floating point numbers. But even then except for a particular instruction (FMAD), all supperted operations were just executed element-by-element, and wouldn't help with accelerating vector/matrix multiplications. – datenwolf Nov 20 '21 at 22:06
@AdamNeeley: With the GPUs made after ~2010 the constraint that everything had be packed into 4-vectors was more or less lifted and GPUs are oriented around manipulating scalars like CPUs are. But what they do support – just like modern CPUs – is the vectorization of operations. So you could load up to 4 scalars into a contiguous range of registers (forming a vector register) and perform the same operation on those 4 scalars in a single instruction. – datenwolf Nov 20 '21 at 22:11
@AdamNeeley: Now what GPUs do is, that they push the parallelization a step further by running the same program (in lockstep), in parallel on a multitude of sets of otherwise identically arranged registers. However each parallel instance is receiving it's invocation index in a special register, which is subsequently used to index the data loads and stores (from bulk memory) – otherwise the parallel execution is idempotent. – datenwolf Nov 20 '21 at 22:17
@AdamNeeley: Since even modern GPUs actually don't have dedicated matrix multiplication instructions, not even to speaking of circuitry, GPUs too, are doing matrix multiplications as nested loops. Albeit for vectors and matrices of dimension 4(×4) or smaller only 2 nested loops, for there is this one exceptional instruction `FMAD` (fused multiply add) which is essentially a beefed up inner product (dot product), so it shaves off that need for that innermost nested loop implementing the inner product. – datenwolf Nov 20 '21 at 22:20
@AdamNeeley: However, as far as FMAD instructions go: Your typical CPU does support those as well, ever since we had Intel's MMX and AMD 3Dnow, which have been around for over 20 years. And although autovectorization of compilers isn't perfect, all compilers I tested it with are smart enough to autovectoize the inner loops of e.g. my `linmath.h` library pretty well. – datenwolf Nov 20 '21 at 22:23
Thanks for that amazing response. That was a very nice overview of the history of GPUs–probably worthy of its own blog post or something :D. I think there was some miscommunication about nested loops (probably my fault). I didn't mean to imply that the code in the GPU doesn't loop. What I meant was that matrix products are something that a library or game engine would optimize by sending it to the GPU. Although the library would still use nested loops, it would be abstracted away behind some matrix product function, so the programmer would never see it. – Adam Neeley Nov 20 '21 at 23:19
In other words, you wouldn't see nested loops in my highly efficient O(N^3) algorithm XD. You would just see two nice matrix products that would be handled by a GPU in a very efficient way. I can definitely see how others could get confused about this ambiguity. I will try to be more careful with how I speak about such technical topics. In my defense, I did introduce myself as a math undergrad, not a computer expert :D – Adam Neeley Nov 20 '21 at 23:25
@AdamNeeley: The overhead of all the stuff that must happen, just to make a GPU execute a particular piece of code, not to speak of the act of transferring the data to it in the first place is so **HUGE** that for the bulk of matrix math happening in game engines it would slow down the program to a crawl if it actually was done that way. Here are a few numbers: The interface between a dedicated GPU and the CPU (current state of the art is 16× PCIe Gen4) is a "mere" 24GB/s. Setting up the transfer in the first place takes a couple of hundreds of nanoseconds. – datenwolf Nov 20 '21 at 23:27
@AdamNeeley: The memory access latencies purely CPU side (currently) are ~1ns on the L1 cache, L2 hits take about 5ns, L3 hits clock in at about 10ns. But the moment you hit the RAM it's already ~50ns. But that's still a lot quicker, than the full roundtrip going to the GPU. – GPUs are great if you have **a lot** of data that is very similar in nature and needs to be processed in a uniform way and where program execution on a "micro" level doesn't depend on the values at the single instruction level. – datenwolf Nov 20 '21 at 23:32
@AdamNeeley: As a practical example take a physics engine; assume each objects' positions and momenta are stored as 4×3 matrices, that at each step of the simulation a collision event may happen, which in turn might effect the execution of the simulation. for example break down a single object into a whole bunch of smaller ones; and in a larger simulation these decisions happen independently of all the other objects in the simulation (yes, cascading effects are propagated eventually). Doing a lot of back and forth between CPU and GPU would waste **HUGE** amounts of time. – datenwolf Nov 20 '21 at 23:38
@AdamNeeley: GPUs are perfectly suited for doing (rigid body) physics simulations, but we'd try to keep all the data on the GPU and try to control the execution of the simulation from within the GPU itself. – datenwolf Nov 20 '21 at 23:40
But GPUs being better at some things than a CPU is just an indisputable ontological truth. Otherwise, why would GPUs exist and be designed differently from CPUs? (Maybe they are being phased out?) From my understanding, one its strengths is repetitive simple calculations such as those found in linear algebra. This does not mean that GPUs directly handle matrices, it could just mean linear algebra operations are good thing to use them for. – Adam Neeley Nov 20 '21 at 23:41
Oops, I didn't see your comments until I posted that. Let me catch up real quick. – Adam Neeley Nov 20 '21 at 23:44
@AdamNeeley: Maybe an analogy helps: GPUs are like freight trains or huge cargo ships: It takes a lot of effort to load (and at their destination to unload) them with cargo and get them going, but once they're in motion they're very good at moving a lot of stuff along the decided path to a choosen location. CPUs are like your single passenger car. Very good at moving a person (or 6) or small amount of cargo between places. But they're terrible to move a lot of stuff inside a given corridor. – datenwolf Nov 20 '21 at 23:46
Yes, that makes sense to me. I did previously mention that there was probably too much overhead in using the GPU to compute products for such sparse matrices. Sometimes matrices are really big, though. A GPU should be useful for those situations, or parallel processing at least. One last thing, could you please briefly explain the ideas behind how a FFT, kernel, and spectrum is used to interpolate data? – Adam Neeley Nov 21 '21 at 00:00
Yes that analogy makes sense. It reminds me of Terry Davis' metaphor of Linux = 18 wheeler and TempleOS = motorcycle. I think Windows was a sedan or something. – Adam Neeley Nov 21 '21 at 00:04
@AdamNeeley: I take it, that you're familiar with how to calculate the convolution between two vectors *a* and *b*: In short for every element of vector *a*, you take the inner vector of the range around a with the vector *b*. But this is just multiplying vector *a* with a matrix of vector *b* drawn along the diagonal. The vector *b* is called the convolution kernel. The Convolution theorem tells us, that the convolution of two vectors a and b is the same as taking the element wise multiplication of the Fourier transforms of each vector, transformed back. – datenwolf Nov 21 '21 at 00:56
@AdamNeeley: Interpolation is just performing a convolution with an appropriately chosen convolution kernel. For example for linear interpolation the kernel would be the triangular function with the target sample distance being the width of the triangle; nearest neighbor is done using the boxcar function. And the zero padding in fourier space, would correspond to a sin(x)/x kernel. – datenwolf Nov 21 '21 at 01:01
@AdamNeeley: Oh, and yes, GPUs are really well suited to deal with HUGE matrices. And by huge we're talking about the total number of nonzero elements. They're also really good at dealing with sparse matrices, as long as the number of nonzero elements is large enough (several million) to justify the (mostly constant) overhead of talking to the GPU in the first place. Out of curiosity I did a quick and dirty test against the RX6900 I have in my computer and the bottom line overhead just to get its gears moving is about 200µs (which is a small eternity). – datenwolf Nov 21 '21 at 01:07
Ooh, that helps a lot. I still don't fully understand, but now I recall what the kernel was in Laplace transformations (I wish my DE class had explained connections to linear algebra better). So the matrix would have to be something like 1000x1000 before even thinking about using a GPU, and even then, probably only for computations that are more complicated than linear transposition. That seems reasonable to me. Thanks! – Adam Neeley Nov 21 '21 at 02:22

score 0 · Answer 2 · answered Nov 24 '22 at 07:30

Intel software architects use this type of interpolations, but there is a big BUT ... They use it only on AVX512 Instruction Set. If your CPU has AVX512 and you got knowledge of low-level writing code as AVX, this is the way to go. [Reference: Intel FLEXRAN 5G comm SDK, SRS channel estimation]

Why not (linearly) interpolate a matrix by matrix multiplication?

2 Answers2