What is the fastest way to draw a rectangle in a REPA array?

Question

Usually, a good method of drawing a rectangle in a bitmap is looping through the 2 bounding dimensions and setting individual pixels. For example, on pseudocode:

drawRect(array, color, x, X, y, Y):
    for x from x til X:
        for y from y til Y:
            array[x,y] = color

What is the equivalent on Haskell's REPA?

How about [`fromFunction`](https://hackage.haskell.org/package/repa-3.3.1.2/docs/Data-Array-Repa.html#v:fromFunction)? For the 2D case, it would be something like `drawRect x y colour = fromFunction (Z :. y :. x) (const colour)`. Unless you want to actually update an array (probably a bad idea), then use `\colour -> map (const colour)`. — user2407038, Jan 17 '15 at 22:03
The problem is that I don't want to check every pixel of a 1024x1024 image to draw a small circle on pos (600, 600). — MaiaVictor, Jan 18 '15 at 00:12
I believe you will have to use fairly low level operations if you are really worried about iterating over each pixel. The simple way is to create the small image, append arrays filled with zeroes to get the proper positioning, then add them together. The not-so-simple way is use `loadP/S` to write the "large" image, then `loadRangeP/S` to write the small "image". Your pseudocode seems to assume mutation but the point of a high level library like REPA is to *not* expose the underlying mutation to the user. Maybe it would help if you gave the actual problem you are trying to solve. — user2407038, Jan 18 '15 at 01:06

score 2 · Accepted Answer · answered Jan 18 '15 at 02:27

Of the ordinary REPA mechanisms for making new arrays, making an new delayed array is fastest when copying an array once to foreign memory. Actual performance using REPA will depend on what you do with your array.

Let's define the type of a computation that depends only on the position in an array and the current value at that position.

{-# LANGUAGE ScopedTypeVariables #-}

import Data.Array.Repa hiding ((++))
import Data.Array.Repa.Repr.ForeignPtr

import Data.Word
import Control.Monad
import Data.Time.Clock
import System.Mem

type Ghost sh a b = sh -> a -> b

We can define filling in any shape.

fill :: Shape sh => sh -> sh -> a -> Ghost sh a a
fill from to color = go
    where
        {-# INLINE go #-}
        go sh a =
            if inShapeRange from to sh
            then color
            else a

We will use this with three different ways to define a new array - a delayed array, a structured traversal, and an unstructured traversal.

The simplest delaying is fromFunction.

ghostD :: (Shape sh, Source r a) => Ghost sh a b -> Array r sh a -> Array D sh b
ghostD g a = fromFunction (extent a) go
    where
        {-# INLINE go #-}
        go sh = g sh (a ! sh)

A structured traversal can take advantage of knowing the structure of the underlying array representation. Unfortunately, the only way we can get information about the position in a structured traversal is by zipping using szipWith with an array that somehow already contains the position information.

ghostS :: (Shape sh, Structured r1 a b, Source r1 a) => Ghost sh a b -> Array r1 sh a -> Array (TR r1) sh b
ghostS g a = szipWith ($) ghost a
    where
        ghost = fromFunction (extent a) g

An unstructured traversal is very similar to a delayed array built by fromFunction; it also returns an Array D.

ghostT :: (Shape sh, Source r a) => Ghost sh a b -> Array r sh a -> Array D sh b
ghostT g a = traverse a id go
    where
        {-# INLINE go #-}
        go lookup sh = g sh (lookup sh)

With some extremely naive benchmarking, we can run these and see how fast they are. We perform garbage collection before measuring the time to try to get reliable timing results. We'll have two benchmarks. For each mechanism,. we'll run a single step writing the result to memory 10 times. Then we'll compose 101 identical steps writing the result to memory once.

bench :: Int -> String -> IO a -> IO ()
bench n name action = do
    performGC
    start <- getCurrentTime
    replicateM_ n action    
    performGC
    end <- getCurrentTime
    putStrLn $ name ++ " " ++ (show (diffUTCTime end start / fromIntegral n))

iterN :: Int -> (a -> a) -> (a -> a)
iterN 0 f = id
iterN n f = f . iterN (n-1) f

main = do
    (img :: Array F DIM2 Word32) <- computeP (fromFunction (Z :. 1024 :. 1024 ) (const minBound))
    let (Z :. x :. y ) = extent img
        drawBox = fill (Z :. 20 :. 20 ) (Z :. x - 20 :. y - 20 ) maxBound

    bench 10 "Delayed      10x1" ((computeP $ ghostD drawBox img) :: IO (Array F DIM2 Word32))
    bench 10 "Unstructured 10x1" ((computeP $ ghostT drawBox img) :: IO (Array F DIM2 Word32))
    bench 10 "Structured   10x1" ((computeP $ ghostS drawBox img) :: IO (Array F DIM2 Word32))

    bench 1 "Delayed      1x101" ((computeP $ (iterN 100 (ghostD drawBox)) . ghostD drawBox $ img) :: IO (Array F DIM2 Word32))
    bench 1 "Unstructured 1x101" ((computeP $ (iterN 100 (ghostT drawBox)) . ghostT drawBox $ img) :: IO (Array F DIM2 Word32))
    bench 1 "Structured   1x101" ((computeP $ (iterN 100 (ghostS drawBox)) . ghostS drawBox $ img) :: IO (Array F DIM2 Word32))

The resulting times are averaged over the number of times the array was forced by being written to foreign memory. These results are typical of multiple runs on my machine.

Delayed      10x1 0.0234s
Unstructured 10x1 0.02652s
Structured   10x1 0.02652s
Delayed      1x101 0.078s
Unstructured 1x101 0.0936s
Structured   1x101 0.2652s

The results don't seem to depend on the order the benchmarks are run in.

Structured   10x1 0.03276s
Unstructured 10x1 0.02652s
Delayed      10x1 0.01716s
Structured   1x101 0.2184s
Unstructured 1x101 0.1092s
Delayed      1x101 0.0624s

These results suggest you can do a handful of full-array computations and still have the performance be dominated by memory access to write the results.

Libraries for rendering scenes by drawing on them typically have a very different structure from REPA, which is mostly built to do data processing tasks across all the data in parallel. Drawing and rendering libraries usually use a graph or tree of scene elements called a scene graph that allows them to quickly cull elements that won't be drawn in an image or a portion of an image. Mutating the result isn't necessary to achieve good performance if you can quickly cull everything that doesn't affect a specific pixel.

What is the fastest way to draw a rectangle in a REPA array?

1 Answers1