Multiplying matrices with repa

Question

I'm playing around with writing a JPEG decoder in pure Haskell (a chance to learn repa for me!), and right now I'm working on the IDCT (which amounts to multiplying a bunch of 8×8 matrices by a ﬁxed 8×8 matrix). The function idctBlocks at the end of the post takes a vector of Word16s as well as blockCount (the count of 8×8 blocks) and compsCount (the number of components, pardon my terminology, in each block, for an image encoded in YCbCr it would be 3) and produces some value based on the results of the IDCTs. The total size of the vector is then blockCount * compsCount * 8 * 8 elements.

Now, I also have a (quite big) test image with about 320k such blocks, each having 3 components, giving about a million 8×8 matrices total. How long would it take to perform IDCT on all of them without anything too smart? My back-of-the-napkin calculation says about "30-250 ms of a single core time":

For each matrix, to compute a single resulting element in the matrix, we need to do 8 multiplications and 7 additions. For a really pessimistic bound, let's say we don't do any SIMD stuff. Reciprocal throughput of all the relevant instructions is 1 on my CPU, so it takes 15 cycles (well, there's also instruction latency, but that shouldn't be a factor, especially if we're talking about doing this lots of times).
To compute each resulting matrix, we need to compute 64 resulting elements, so that's 64×15, or about 1 thousand cycles per resulting matrix.
And we need to compute about 1 million matrices, giving about 1 billion cycles.
My CPU runs at 4 GHz (and the task is super cache friendly so it shouldn't be limited by the memory), so it should take about ¼th of the second — hence the worst estimate of 250 ms.
Now, if the compiler is smart enough, it surely will use SIMD instructions for vector multiplication and for horizontal sum, reducing 15 cycles per element to about two (well, there's also loading and packing part, but the IDCT matrix can wholly reside in eight XMM registers out of 16 my CPU has, so there's no need to load that, and there's also plenty of room for instruction-level parallelism, so we could probably disregard that for the purposes of this estimate). So, that's a speed up of about 8 times, giving the optimistic bound of 250 / 8 ≈ 30 ms.

Given all of the above, I was somewhat dumbfounded to see the function below to take 2 seconds. What's more interesting, it isn't sensitive at all to whether the elements are traversed in row-order or column-order: replacing

        arrSlice =  R.unsafeSlice arr     (sh  :. x :. All)
        idctSlice = R.unsafeSlice idctMat (Any :. y :. All)

with

        arrSlice =  R.unsafeSlice arr     (sh  :. All :. x)
        idctSlice = R.unsafeSlice idctMat (Any :. All :. y)

just for the sake of seeing how the run time changes shows that the run time does not change at all (and I think this alone shows that something is very wrong).

Oh, and if I remove the final R.sumAllS (which I had in my pre-repa times to ensure everything gets evaluated) and replace it with, say, idct blockMats `R.deepSeqArray` 0, the run time does not change in any significant way (well, it reduces for about 5%, but that's insignificant).

So, finally, here's the module:

{-# LANGUAGE OverloadedLists, TypeOperators, FlexibleContexts, TypeFamilies #-}
{-# LANGUAGE Strict #-}
{-# OPTIONS -optlo-O3 #-}

module Data.Image.Jpeg.IDCT(idctBlocks, zigzag) where

import qualified Data.Array.Repa as R
import qualified Data.Array.Repa.Eval as R
import qualified Data.Array.Repa.Unsafe as R
import qualified Data.Vector.Unboxed as V
import Control.Monad.Identity
import Data.Array.Repa (Array, All(..), Any(..), U, Z(..), (:.)(..))
import GHC.Word

zigzag :: V.Vector Int
zigzag =
  [
  0,  1,  5,  6,  14, 15, 27, 28,
  2,  4,  7,  13, 16, 26, 29, 42,
  3,  8,  12, 17, 25, 30, 41, 43,
  9,  11, 18, 24, 31, 40, 44, 53,
  10, 19, 23, 32, 39, 45, 52, 54,
  20, 22, 33, 38, 46, 51, 55, 60,
  21, 34, 37, 47, 50, 56, 59, 61,
  35, 36, 48, 49, 57, 58, 62, 63
  ]

computeP :: (R.Load r sh a, V.Unbox a) => Array r sh a -> Array U sh a
computeP = runIdentity . R.computeP
{-# INLINE computeP #-}

unZigzagify :: (Z :. Int :. Int :. Int) -> (Z :. Int :. Int :. Int)
unZigzagify (_ :. b :. c :. p) = Z :. b :. c :. (zigzag `V.unsafeIndex` p)
{-# INLINE unZigzagify #-}

idctMat :: Array U R.DIM2 Float
idctMat = R.computeUnboxedS $ R.transpose $ R.fromFunction (Z :. 8 :. 8) (\(_ :. r :. c) -> point r (c + 1))
  where
    point 0 _ = sqrt $ 1 / 8
    point r c = sqrt (2 / 8) * cos (pi * (2 * c' - 1) * r' / (2 * 8))
      where
        r' = fromIntegral r
        c' = fromIntegral c

idct :: R.Source r Float
     => Array r R.DIM4 Float
     -> Array U R.DIM4 Float
idct arr = computeP $ R.fromFunction dims f
  where
    dims = R.extent arr
    f (sh :. x :. y) = R.sumAllS (R.zipWith (*) arrSlice idctSlice)
      where
        arrSlice =  R.unsafeSlice arr     (sh  :. x :. All)
        idctSlice = R.unsafeSlice idctMat (Any :. y :. All)
{-# INLINE idct #-}

idctBlocks :: Int -> Int -> V.Vector Word16 -> Float
idctBlocks blocksCount compsCount blocks = R.sumAllS $ idct blockMats
  where
    flatExtent = Z :. blocksCount :. compsCount :. 64 :: R.DIM3
    matExtent = Z :. blocksCount :. compsCount :. 8 :. 8 :: R.DIM4

    reparr = R.fromUnboxed flatExtent blocks
    blockMats = R.map fromIntegral $ R.reshape matExtent $ R.computeUnboxedS $ R.backpermute flatExtent unZigzagify reparr
{-# NOINLINE idctBlocks #-}

I have -O2 -fllvm for my whole project, so those options apply to this file as well. And, of course, I might have messed up the IDCT matrix for its transpose, but that shouldn't affect the performance.

So, what I could do differently, and how this could be improved?

By the way, using the typical repa-recommended matrix multiplication of the form

matMul :: (R.Source r1 Float, R.Source r2 Float, R.Shape sh)
       => Array r1 (sh :. Int :. Int) Float
       -> Array r2 (sh :. Int :. Int) Float
       -> Array U  (sh :. Int :. Int) Float
matMul a b = runIdentity $ R.sumP (R.zipWith (*) a' b')
  where
    a' = R.extend (Any :. All :. (8 :: Int)   :. All) a
    b' = R.extend (Any :. (8 :: Int)   :. All :. All) b
{-# INLINE matMul #-}

along with properly R.extended idctMat doesn't help too much — in fact, performance degrades even further, becoming closer to 30 seconds.

@lsmor yep, I've been advised to use it after I posted my question, and although the performance is slightly better (think 1.5 seconds instead of 2 seconds), it's still far from my estimates. — 0xd34df00d, Apr 13 '21 at 15:55

Multiplying matrices with repa

0 Answers0