How to enable parallelism in the HIP library

Question

I wrote a Mandelbrot set generator in Haskell using the HIP library. Now I am trying to parallelize it. Unfortunately running following code on single then six cores seems to have no effect on performance.

Main.hs

module Main where
import Data.Complex
import Graphics.Image (RPU (RPU), writeImage, makeImageR, Pixel(PixelRGB), Image)
import Graphics.Image.ColorSpace (RGB)
import Debug.Trace (trace)

main :: IO ()
main = do
    let image = makeImageR RPU (height, width) mandelbrotGenerator
    writeImage  "target.jpg" image
    where mandelbrotGenerator :: (Int, Int) -> Pixel RGB Double
          mandelbrotGenerator (x, y) = let reStart = -2
                                           reEnd   =  2
                                           imStart = -2
                                           imEnd   =  2
                                           x' = fromIntegral x
                                           y' = fromIntegral y
                                           width' = fromIntegral width
                                           height' = fromIntegral height
                                           c = (reStart + (x'/width')*(reEnd-reStart)) :+ (imStart + (y'/height')*(imEnd-imStart))
                                        in plotd (mandelbrot c 80)
                                     where 
                                        plotd r | r < 2 = PixelRGB 255 0 0
                                                | otherwise = PixelRGB 0 0 255
          height = 10000
          width  = 10000
          mandelbrot c iter = realPart $ abs $ iterate (\z -> z^2 + c) (0 :+ 0) !! iter

I thought that in case of hip using RPU parameter and ghc-flags: -O2 -threaded would be enough.

Here is code execution time with +RTS -s -N6 flags:

1,378,695,196,456 bytes allocated in the heap
     486,041,624 bytes copied during GC
   4,800,312,192 bytes maximum residency (4 sample(s))
         412,800 bytes maximum slop
            6896 MiB total memory in use (1 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0     87335 colls, 87335 par   34.261s  23.647s     0.0003s    0.0052s
  Gen  1         4 colls,     3 par    0.016s   0.015s     0.0037s    0.0127s

  Parallel GC work balance: 46.93% (serial 0%, perfect 100%)

  TASKS: 14 (1 bound, 13 peak workers (13 total), using -N6)

  SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

  INIT    time    0.003s  (  0.002s elapsed)
  MUT     time  942.718s  (210.764s elapsed)
  GC      time   34.277s  ( 23.662s elapsed)
  EXIT    time    0.001s  (  0.002s elapsed)
  Total   time  976.999s  (234.430s elapsed)

  Alloc rate    1,462,468,617 bytes per MUT second

  Productivity  96.5% of total user, 89.9% of total elapsed

and here with only +RTS -s:

1,378,694,953,992 bytes allocated in the heap
     415,624,768 bytes copied during GC
   4,800,096,768 bytes maximum residency (4 sample(s))
         366,080 bytes maximum slop
            6873 MiB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0     328664 colls,     0 par    1.885s   2.125s     0.0000s    0.0005s
  Gen  1         4 colls,     0 par    0.013s   0.013s     0.0033s    0.0123s

  TASKS: 4 (1 bound, 3 peak workers (3 total), using -N1)

  SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

  INIT    time    0.001s  (  0.000s elapsed)
  MUT     time  180.651s  (180.173s elapsed)
  GC      time    1.898s  (  2.138s elapsed)
  EXIT    time    0.000s  (  0.009s elapsed)
  Total   time  182.550s  (182.321s elapsed)

  Alloc rate    7,631,819,383 bytes per MUT second

  Productivity  99.0% of total user, 98.8% of total elapsed

EDIT: I noticed this is repa-only related issue. I changed code to following:

main :: IO ()
main = do
   image <- (computeP $ fromFunction (Z:.width:.height) mandelbrotGenerator :: IO (Array U DIM2 (Pixel RGB Double)))
   putStrLn "DONE"
--   writeImage  "target.jpg" $ fromRepaArrayS image
       where mandelbrotGenerator :: DIM2 -> Pixel RGB Double
             mandelbrotGenerator (Z:.x:.y) = let reStart = -2
                                                 reEnd   =  2
                                                 imStart = -2
                                                 imEnd   =  2
                                                 x' = fromIntegral x
                                                 y' = fromIntegral y
                                                 width' = fromIntegral width
                                                 height' = fromIntegral height
                                                 c = (reStart + (x'/width')*(reEnd-reStart)) :+ (imStart + (y'/height')*(imEnd-imStart))
                                              in plotd (mandelbrot c 80)
                                           where 
                                           plotd r | r < 2 = PixelRGB 255 0 0
                                                   | otherwise = PixelRGB 0 0 255
             height = 10000
             width  = 10000
             mandelbrot c iter = realPart $ abs $ iterate (\z -> z^2 + c) (0 :+ 0) !! iter

and execution times are very similar on both one-core and six-core setup.

What part of your code do you expect will make this run in parallel? Adding cores won't automatically make your code execute in a parallel way; you have to explicitly program for parallelism. — Louis Wasserman, May 09 '23 at 22:40
Doesn't RPU parameter to makeImageR make repa backend use computeS when calculating value of pixels? — superstate, May 11 '23 at 18:27
I changed main fragment to writeImage "target.jpg" $ fromRepaArrayP $ fromFunction (Z:.width:.height) mandelbrotGenerator and still no preformance improvement. — superstate, May 11 '23 at 18:55
My guess would be that the chunks of computation are small enough that the overhead of parallelism isn't worth it. Perhaps you could try computing entire rows (or columns, your choice) sequentially, but the other dimension in parallel, to increase the work load per "thread". But I have not tested this hypothesis in any way. — Daniel Wagner, May 12 '23 at 21:51

score 1 · Accepted Answer · answered May 12 '23 at 22:41

I can't replicate your problem. On my 8-core, 16-thread laptop (with an Intel i9-9980HK), the runtime of your REPA version compiled under Stack LTS-20.20 (GHC 9.2.7 with flags -O2 -threaded) is:

+RTS flags	elapsed time (sec)
-s -N1	121
-s -N2	71
-s -N3	52
-s -N4	44
-s -N6	46
-s -N8	63
-s -N16	75

so while there are some diminishing returns, there are definite gains when using 2-4 cores.

Maybe post the statistics output for the REPA version, try a few different -N values, and consider building with a newer version of GHC and the various packages to see if that makes a difference.

I tried to run with different -N. I noticed that only -N2 gave performance increase (173s over 189s). I am suprised that repa have such parallelism overhead. I used GHC 9.2.4. I have i7-8750H — superstate, May 13 '23 at 20:45

How to enable parallelism in the HIP library

1 Answers1