Faster SumSquareDifference in Haskell

Question

I am implementing a fractal image compression algorithm of binary images in Haskell. For this purpose i have to find to a given range block (a sub-image) the closest image in a so called domain pool, a list of lists of images. I am comparing images by calculating the sum square difference of both their pixel values.

I use the Haskell Image Processing (HIP) library for reading and writing images.

compress :: Image VS X Bit -> Int -> [(Int, Int)]
compress img blockSize = zip dIndices tIndices
    where rImg = img
          dImg = downsample2 rImg
          rBlocks = (toBlocks rImg blockSize) :: [Image VS X Bit]
          dBlocks = (toBlocks dImg blockSize) :: [Image VS X Bit]
          dPool = (createDPool dBlocks) :: [[Image VS X Bit]]
          distanceLists = map (\x -> (map.map) (distance x) dPool) rBlocks
          dIndices = map (fst . getMinIndices) distanceLists
          tIndices = map (snd . getMinIndices) distanceLists


distance :: Image VS X Bit -> Image VS X Bit-> Int
distance x y = sumSquareDifference (toBinList x) (toBinList y)
    where toBinList = map (toNum . extractBitOfPixel) . concat . toLists

toLists :: MArray arr cs e => Image arr cs e -> [[Pixel cs e]]
toLists img = [[index img (i, j) | j <- [0..cols img -1]] | i <- [0.. rows img -1]]

extractBitOfPixel :: Pixel X Bit -> Bit
extractBitOfPixel (PixelX b) = b

sumSquareDifference :: [Int] -> [Int] -> Int
sumSquareDifference a b = sum $ zipWith (\x y -> (x-y)^2) a b

The performance of this code is really bad. Compressing a 256x256 image with a block size of 2 takes around 5 minutes despite compiling with -O2. Profiling shows me that most of the runtime is spent in the function distance, especially in sumSquareDifference, but also in toLists and toBinList:

       binaryCompressionSimple +RTS -p -RTS

    total time  =     1430.89 secs   (1430893 ticks @ 1000 us, 1 processor)
    total alloc = 609,573,757,744 bytes  (excludes profiling overheads)

COST CENTRE               MODULE    SRC                                        %time %alloc

sumSquareDifference       Main      binaryCompressionSimple.hs:87:1-63          30.9   28.3
toLists                   Main      binaryCompressionSimple.hs:66:1-90          20.3   47.0
distance.toBinList        Main      binaryCompressionSimple.hs:74:11-79         10.9   15.1
main                      Main      binaryCompressionSimple.hs:(14,1)-(24,21)    7.3    0.0
compress                  Main      binaryCompressionSimple.hs:(28,1)-(36,60)    6.9    0.0
distance                  Main      binaryCompressionSimple.hs:(71,1)-(74,79)    5.7    0.9
compress.distanceLists.\  Main      binaryCompressionSimple.hs:34:38-65          5.2    4.4
compress.distanceLists    Main      binaryCompressionSimple.hs:34:11-74          2.8    0.0
main.\                    Main      binaryCompressionSimple.hs:20:72-128         2.7    0.0
getMinIndices.getMinIndex Main      binaryCompressionSimple.hs:116:11-53         2.7    1.8
sumSquareDifference.\     Main      binaryCompressionSimple.hs:87:52-58          2.7    2.5

Is there a way to improve performance?

A block size of 2 means comparing 16384 range blocks each with 131072 images of the domain pool, so sumSquareDifference will be called (16384*131072=)2147483648 times and calculate each time the sum square difference of two lists with length=4. I realize this is a large number but shouldn't the code be faster anyway (lazy evaluating of lists)? Is this a Haskell problem or an algorithm problem?

Edit:

I was able to at least improve the performance by a third by using:

distance :: Image VS X Bit -> Image VS X Bit-> Int
distance x y
     | x == y = 0
     | otherwise = sumSquareDifference (toBinList x) (toBinList y)
    where toBinList = map (toNum . extractBitOfPixel) . concat . inlinedToLists

Edit 2:

I was able to increase the performance enormously by creating dPool with the function genDistanceList, which stops the calculation as soon as two identical images are found:

genDistanceList :: [[Image VS X Bit]] -> Image VS X Bit -> [[Int]]
genDistanceList dPool rBlock = nestedTakeWhileInclusive (/= 0) $ (map.map) (distance rBlock) dPool

“Is this a Haskell problem or an algorithm problem?” Probably both. I think it's a bit of a broad question, could you focus it some more? — leftaroundabout, Oct 07 '20 at 22:54
@leftaroundabout My main question is, what are the possibilities to improve the performance? I will probably not be able to change anything about the algorithm. — CifarettoWiggum, Oct 07 '20 at 22:57
re: "shouldn't the code be faster anyway (lazy evaluating of lists)?" Lazy evaluation can't do much when your result is a sum -- you can't know the value of the sum until you've visited all the values being summed, so laziness doesn't get to omit any calculations. — Daniel Wagner, Oct 07 '20 at 23:35
...but I suspect you can make a better algorithm here. Since your pixels are bits, your "sumSquareDifference" is actually just counting how many bits differ. It should be easy to make that more efficient than converting to an `Int`, subtracting, and squaring as a first pass; and skipping the rest of an image once that count goes over the current minimum is another obvious first step towards optimization. — Daniel Wagner, Oct 07 '20 at 23:39
@DanielWagner In dPool as well as in rBlocks there will probably be many duplicate images. Shouldn't Haskell be able to remember that it has already calculated the distance between two images and use the return value of the last calculation? — CifarettoWiggum, Oct 08 '20 at 00:05
@DanielWagner Since I will switch from binary images to grayscale images after I have optimized the performance, I will need `Int` again at the latest. — CifarettoWiggum, Oct 08 '20 at 00:08
No, Haskell will not magically remember all the previously calculated function calls. In general, that would use tons of memory and be very inefficient. If you have loads of duplicates, maybe you should avoid having loads of duplicates? But I imagine duplicates are less common in grayscale. — dfeuer, Oct 08 '20 at 00:40
@CifarettoWiggum Another optimization idea would be something like this: chunk pixels into groups of size n (perhaps by taking n=m^2 and using mxm-pixel squares, but any chunking should do). For each group, precompute the minimum and maximum pixel value (in grayscale). This will let you compute upper and lower bounds of the squared distance between any given pair of images n/2 times faster than the exact squared distance. This can be used to ignore pairs whose lower bound is bigger than the best upper bound; if this is a lot of the pairs, this could be a win. — Daniel Wagner, Oct 08 '20 at 03:31

Daniel Wagner · Answer 1 · 2020-10-08T04:00:00.983

The absolute first thing to try is skipping the conversion to lists:

{-# INLINE numIndex #-}
numIndex :: Image VS X Bit -> (Int, Int) -> Int
numIndex img pos = toNum . extractBitOfPixel $ index img pos

distance :: Image VS X Bit -> Image VS X Bit -> Int
distance a b = sum
    [ (numIndex a pos - numIndex b pos)^2
    | i <- [0 .. cols a-1]
    , j <- [0 .. rows a-1]
    , let pos = (i, j)
    ]

Since you haven't provided us with a minimal reproducible example, it's impossible to tell what effect, if any, that would have. If you want better advice, provide better data.

EDIT

Looking through the haddocks for hip, I suspect the following will be even better still:

distance :: Image VS X Bit -> Image VS X Bit -> Int
distance a b = id
    . getX
    . fold (+)
    $ zipWith bitDistance a b

bitDistance :: Pixel X Bit -> Pixel X Bit -> Pixel X Int
bitDistance (PixelX a) (PixelX b) = PixelX (fromIntegral (a-b))
-- use (a-b)^2 when you switch to grayscale, but for Bit the squaring isn't needed

Here, the fold and zipWith are the ones provided by hip, not base.

Thanks a lot for your help! Thanks to your suggestions I was able to improve the performance by another 14%. — CifarettoWiggum, Oct 08 '20 at 05:38

Faster SumSquareDifference in Haskell

1 Answers1