16

I was working on a Project Euler problem and ended up with a Haskell file that included a function that looked like this:

matches :: (a -> a -> Bool) -> a -> [(a, Int)] -> Int
matches f cs = foldr (\(cs', n) a -> fromBool (f cs cs') * n + a) 0

With fromBool imported from Foreign.Marshal.Utils just to quickly convert True to 1 and False to 0.

I was trying to get a little more speed out of my solution so I tried switching from foldr to foldl' (switching the arguments in the process) as I assumed foldr didn't make much sense to use on numbers.

Switching from foldr to foldl' caused me to allocate more than twice as much memory according to GHC's profiler.

For fun I also decided to replace the lambda with a pointfree version of the function:

matches :: (a -> a -> Bool) -> a -> [(a, Int)] -> Int
matches f cs = foldr ((+) . uncurry ((*) . fromBool . f cs)) 0

This caused my memory allocation to increase 20x from the foldr version.

Now this isn't a huge deal as even in the 20x case the total memory allocation was only about 135Mb and the runtime of the program was relatively unaffected, if anything the higher memory allocation versions ran slightly faster.

But I am really curious as to how these results could be possible, so that in future I will be able to pick the "right" function when I don't have as much leeway.

EDIT:

GHC version 7.10.2, compiled with -O2 -prof -fprof-auto. Executed with +RTS -p.

EDIT 2:

Alright it looks like this is too difficult to reproduce to omit the rest of the code, well here is the entire program:

SPOILERS BELOW:

{-# LANGUAGE NoMonomorphismRestriction #-}

import Control.Monad
import Data.List
import Foreign.Marshal.Utils

data Color = Red | Green | Blue deriving (Eq, Enum, Bounded, Show)

colors :: [Color]
colors = [Red ..]

matches :: (a -> a -> Bool) -> a -> [(a, Int)] -> Int
matches f x = foldr ((+) . uncurry ((*) . fromBool . f x)) 0
-- matches f x = foldr (\(y, n) a -> fromBool (f x y) * n + a) 0
-- matches f x = foldl' (\a (y, n) -> fromBool (f x y) * n + a) 0

invert :: [([Color], Int)] -> [([Color], Int)]
invert rs = (\cs -> (cs, matches valid cs rs)) <$> choices
  where
    len = maximum $ length . fst <$> rs
    choices = replicateM len colors
    valid (x : xs) (y : ys) = x /= y && valid xs ys
    valid _ _ = True

expand :: [([Color], Int)] -> [([Color], Int)]
expand rs = (\cs -> (cs, matches valid cs rs)) <$> choices
  where
    len = maximum $ length . fst <$> rs
    choices = replicateM (len + 1) colors
    valid (x1 : x2 : xs) (y : ys) = x1 /= y && x2 /= y && valid (x2 : xs) ys
    valid _ _ = True

getRow :: Int -> [([Color], Int)]
getRow 1 = flip (,) 1 . pure <$> colors
getRow n = expand . invert $ getRow (n - 1)

result :: Int -> Int
result n = sum $ snd <$> getRow n

main :: IO ()
main = print $ result 8
semicolon
  • 2,530
  • 27
  • 37
  • just to make it clear: did you compile with optimizations, which GHC did you use and can you please add the `foldl'` version (or at least say if you just switched the arguments or did change anything else)? - also which Euler problem is this in case anyone want to run the example? – Random Dev Jul 18 '16 at 04:26
  • 1
    btw: are the `cs` lists themselves and what is your `f`? – Random Dev Jul 18 '16 at 04:32
  • The `cs` themselves were lists yeah. Lists of a custom simple enum data type. One `f` (the one that took up most of the allocation) was `valid (x : xs) (y : ys) = x /= y && valid xs ys` and `valid _ _ = True`. I didn't want to copy too much into the question to avoid spoiling the Project Euler problem as it is a fairly difficult (70% difficulty) and very fun problem. – semicolon Jul 18 '16 at 04:35
  • 1
    Why `fromBool` and not just `fromEnum`? – dfeuer Jul 18 '16 at 05:05
  • 1
    While I have an explanation, it's really hard to reproduce your behaviour completely. The culprits are `uncurry` and some optimizations, that's for sure, but in order to provide a complete answer, some example functions/values would be _really_ helpful. – Zeta Jul 18 '16 at 09:09
  • @chi isn't this part of the question (seems like a similar switch took 2-times the memory) - TBH I'm more interested in this part – Random Dev Jul 18 '16 at 10:10
  • @Carsten Oh, right. I missed a paragraph above. Interesting... – chi Jul 18 '16 at 11:08
  • 1
    Grah. This is ridiculous. If you use `a ~ Int`, GHC creates a point-free variant with the same allocation behaviour, ___unless you call the function twice___. If you use `a ~ Bool`, point-free functions _always_ lead to the allocation behavior. So, long story short, it's a PITA to reproduce your behaviour exactly. Please add at least the used type for `a`. – Zeta Jul 18 '16 at 11:10
  • 1
    @dfeuer Well now I feel dumb, in my defense `fromBool` appears before `fromEnum` in hoogle when I searched for `Bool -> Int`. – semicolon Jul 18 '16 at 13:48
  • @Zeta I added the code. Was hoping to avoid spoilers and what not. But yeah my code is identical to the code I now pasted into the question. Just varying which 2 of the 3 variants I comment out. – semicolon Jul 18 '16 at 13:49
  • re: spoilers - if you don't mention what problem it solves nobody would be the wiser. – ErikR Jul 18 '16 at 14:32
  • @ErikR yeah that is probably true, but I just didn't want someone that likes Project Euler to thoroughly analyze my code and then stumble on the problem and be disappointed that they already know how to do it. – semicolon Jul 18 '16 at 14:42
  • @semicolon: That has been a wild ride. See answer below. Spoilers: it's, as I thought, `uncurry` and laziness. – Zeta Jul 18 '16 at 15:14

1 Answers1

13

Note: This post is written in literate Haskell. Copy it into a file, save it as *.lhs, and compile/load with in GHC(i). Also, I started writing this answer before you've edited your code, but the lesson stays the same.

TL;DR

The Prelude function uncurry is too lazy, whereas your pattern match is just strict enough.

A word of caution and a disclaimer

We're entering a magical, weird place. Beware. Also, my CORE abilities are rudimentary. Now that I've lost all my credibility, let's get started.

The tested code

In order to know where we get the additional memory requirements, it's useful to have more than two functions.

> import Control.Monad (forM_)

This is your original, non-pointfree variant:

> matches :: (a -> a -> Bool) -> a -> [(a, Int)] -> Int
> matches    f cs = foldr (\(cs', n) a -> fromEnum (f cs cs') * n + a) 0

This is a variant that's only slightly point-free, the parameter a is eta-reduced.

> matchesPF' :: (a -> a -> Bool) -> a -> [(a, Int)] -> Int
> matchesPF' f cs = foldr (\(cs', n) -> (+) (fromEnum (f cs cs') * n)) 0

This is a variant that inlines uncurry by hand.

> matchesPFI :: (a -> a -> Bool) -> a -> [(a, Int)] -> Int
> matchesPFI f cs = foldr ((+) . (\(cs', n) -> fromEnum (f cs cs') * n)) 0

This your pointfree version.

> matchesPF :: (a -> a -> Bool) -> a -> [(a, Int)] -> Int
> matchesPF  f cs = foldr ((+) . uncurry  ((*) . fromEnum . f cs)) 0

This is a variant that uses a custom uncurry, see below.

> matchesPFU :: (a -> a -> Bool) -> a -> [(a, Int)] -> Int
> matchesPFU f cs = foldr ((+) . uncurryI ((*) . fromEnum . f cs)) 0

This is a variant that uses a custom lazy uncurry, see below.

> matchesPFL :: (a -> a -> Bool) -> a -> [(a, Int)] -> Int
> matchesPFL f cs = foldr ((+) . uncurryL ((*) . fromEnum . f cs)) 0

To test the functions easily, we use a list:

> funcs = [matches, matchesPF', matchesPF, matchesPFL, matchesPFU, matchesPFI]

Our self-written uncurry:

> uncurryI :: (a -> b -> c) -> (a, b) -> c
> uncurryI f (a,b) = f a b

A lazier uncurry:

> uncurryL :: (a -> b -> c) -> (a, b) -> c
> uncurryL f p = f (fst p) (snd p)

The lazy variant uncurryL has the same semantics as the variant in Prelude, e.g.

uncurry (\_ _ -> 0) undefined == 0 == uncurryL (\_ _ -> 0) undefined

whereas uncurryI is strict in the pair's spine.

> main = do
>   let f a b = a < b
>   forM_ [1..10] $ \i ->
>     forM_ funcs $ \m ->
>       print $ m f i (zip (cycle [1..10]) [1..i*100000])

The list [1..i*100000] depends on i deliberately, so that we don't introduce a CAF and skew our allocation profile.

The desugared code

Before we delve into the profile, let's have a look at the desugared code of each function:

==================== Desugar (after optimization) ====================
Result size of Desugar (after optimization)
  = {terms: 221, types: 419, coercions: 0}

uncurryL
uncurryL = \ @ a @ b @ c f p -> f (fst p) (snd p)

uncurryI
uncurryI = \ @ a @ b @ c f ds -> case ds of _ { (a, b) -> f a b }

-- uncurried inlined by hand
matchesPFI =
  \ @ a f cs ->
    foldr
      $fFoldable[]
      (. (+ $fNumInt)
         (\ ds ->
            case ds of _ { (cs', n) ->
            * $fNumInt (fromEnum $fEnumBool (f cs cs')) n
            }))
      (I# 0)

-- lazy uncurry
matchesPFL =
  \ @ a f cs ->
    foldr
      $fFoldable[]
      (. (+ $fNumInt)
         (uncurryL (. (* $fNumInt) (. (fromEnum $fEnumBool) (f cs)))))
      (I# 0)

-- stricter uncurry
matchesPFU =
  \ @ a f cs ->
    foldr
      $fFoldable[]
      (. (+ $fNumInt)
         (uncurryI (. (* $fNumInt) (. (fromEnum $fEnumBool) (f cs)))))
      (I# 0)

-- normal uncurry
matchesPF =
  \ @ a f cs ->
    foldr
      $fFoldable[]
      (. (+ $fNumInt)
         (uncurry (. (* $fNumInt) (. (fromEnum $fEnumBool) (f cs)))))
      (I# 0)

-- eta-reduced a
matchesPF' =
  \ @ a f cs ->
    foldr
      $fFoldable[]
      (\ ds ->
         case ds of _ { (cs', n) ->
         + $fNumInt (* $fNumInt (fromEnum $fEnumBool (f cs cs')) n)
         })
      (I# 0)

-- non-point-free
matches =
  \ @ a f cs ->
    foldr
      $fFoldable[]
      (\ ds a ->
         case ds of _ { (cs', n) ->
         + $fNumInt (* $fNumInt (fromEnum $fEnumBool (f cs cs')) n) a
         })
      (I# 0)

So far, everything seems well. There's nothing surprising going on. Typeclass functions are replaced with their dictionary variants, e.g. foldr becomesfoldr $fFoldable[]`, since we call it on a list.

The profile

   Mon Jul 18 15:47 2016 Time and Allocation Profiling Report  (Final)

       Prof +RTS -s -p -RTS

    total time  =        1.45 secs   (1446 ticks @ 1000 us, 1 processor)
    total alloc = 1,144,197,200 bytes  (excludes profiling overheads)

COST CENTRE  MODULE    %time %alloc

matchesPF'   Main       13.6    0.0
matchesPF    Main       13.3   11.5
main.\.\     Main       11.8   76.9
main.f       Main       10.9    0.0
uncurryL     Main        9.5   11.5
matchesPFU   Main        8.9    0.0
matchesPFI   Main        7.3    0.0
matches      Main        6.9    0.0
matchesPFL   Main        6.3    0.0
uncurryI     Main        5.3    0.0
matchesPF'.\ Main        2.6    0.0
matchesPFI.\ Main        2.0    0.0
matches.\    Main        1.5    0.0


                                                             individual     inherited
COST CENTRE        MODULE                  no.     entries  %time %alloc   %time %alloc

MAIN               MAIN                     44           0    0.0    0.0   100.0  100.0
 main              Main                     89           0    0.0    0.0   100.0  100.0
  main.\           Main                     90          10    0.0    0.0   100.0  100.0
   main.\.\        Main                     92          60   11.8   76.9   100.0  100.0
    funcs          Main                     93           0    0.0    0.0    88.2   23.1
     matchesPFI    Main                    110          10    7.3    0.0    11.7    0.0
      matchesPFI.\ Main                    111     5500000    2.0    0.0     4.4    0.0
       main.f      Main                    112     5500000    2.4    0.0     2.4    0.0
     matchesPFU    Main                    107          10    8.9    0.0    15.3    0.0
      uncurryI     Main                    108     5500000    5.3    0.0     6.4    0.0
       main.f      Main                    109     5500000    1.1    0.0     1.1    0.0
     matchesPFL    Main                    104          10    6.3    0.0    17.7   11.5
      uncurryL     Main                    105     5500000    9.5   11.5    11.4   11.5
       main.f      Main                    106     5500000    1.9    0.0     1.9    0.0
     matchesPF     Main                    102          10   13.3   11.5    15.4   11.5
      main.f       Main                    103     5500000    2.1    0.0     2.1    0.0
     matchesPF'    Main                     99          10   13.6    0.0    17.2    0.0
      matchesPF'.\ Main                    100     5500000    2.6    0.0     3.6    0.0
       main.f      Main                    101     5500000    1.0    0.0     1.0    0.0
     matches       Main                     94          10    6.9    0.0    10.9    0.0
      matches.\    Main                     97     5500000    1.5    0.0     4.0    0.0
       main.f      Main                     98     5500000    2.5    0.0     2.5    0.0
 CAF               Main                     87           0    0.0    0.0     0.0    0.0
  funcs            Main                     91           1    0.0    0.0     0.0    0.0
  main             Main                     88           1    0.0    0.0     0.0    0.0
   main.\          Main                     95           0    0.0    0.0     0.0    0.0
    main.\.\       Main                     96           0    0.0    0.0     0.0    0.0
 CAF               GHC.IO.Handle.FD         84           0    0.0    0.0     0.0    0.0
 CAF               GHC.Conc.Signal          78           0    0.0    0.0     0.0    0.0
 CAF               GHC.IO.Encoding          76           0    0.0    0.0     0.0    0.0
 CAF               GHC.IO.Handle.Text       75           0    0.0    0.0     0.0    0.0
 CAF               GHC.IO.Encoding.Iconv    59           0    0.0    0.0     0.0    0.0

Ignore the main\.\. noise, it's just the list. However, there's one point that one should notice immediately: matchesPF and uncurryL use the same alloc%:

matchesPF    Main       13.3   11.5
uncurryL     Main        9.5   11.5

Getting to the CORE

Now it's time to inspect the resulting CORE (ghc -ddump-simpl). We'll notice that most of the functions have been transformed into worker wrappers, and they look more or less the same (-dsuppress-all -dsuppress-uniques):

$wa5
$wa5 =
  \ @ a1 w w1 w2 ->
    letrec {
      $wgo
      $wgo =
        \ w3 ->
          case w3 of _ {
            [] -> 0;
            : y ys ->
              case y of _ { (cs', n) ->
              case $wgo ys of ww { __DEFAULT ->
              case w w1 cs' of _ {
                False -> case n of _ { I# y1 -> ww };
                True -> case n of _ { I# y1 -> +# y1 ww }
              }
              }
              }
          }; } in
    $wgo w2

This is your usual worker-wrapper. $wgo takes a list, checks whether it's empty, is strict in the head (case y of _ { (cs', n) ->…) and lazy in the recursive result $wgo ys of ww.

All functions look the same. Well, all except matchesPF (your variant)

-- matchesPF
$wa3 =
  \ @ a1 w w1 w2 ->
    letrec {
      $wgo =
        \ w3 ->
          case w3 of _ {
            [] -> 0;
            : y ys ->
              case $wgo ys of ww { __DEFAULT ->
              case let {
                     x = case y of _ { (x1, ds) -> x1 } } in
                   case w w1 x of _ {
                     False ->
                       case y of _ { (ds, y1) -> case y1 of _ { I# y2 -> main13 } };
                              -- main13 is just #I 0
                     True -> case y of _ { (ds, y1) -> y1 }
                   }
              of _ { I# x ->
              +# x ww
              }
              }
          }; } in
    $wgo w2

and matchesPFL (the variant that uses the lazy uncurryL)

-- matchesPFL
$wa2
$wa2 =
  \ @ a1 w w1 w2 ->
    letrec {
      $wgo =
        \ w3 ->
          case w3 of _ {
            [] -> 0;
            : y ys ->
              case $wgo ys of ww { __DEFAULT ->
              case snd y of ww1 { I# ww2 ->
              case let {
                     x = fst y } in
                   case w w1 x of _ {
                     False -> main13;
                     True -> ww1
                   }
              of _ { I# x ->
              +# x ww
              }
              }
              }
          }; } in
    $wgo w2

They are virtually the same. And both of them contain let bindings. This will create a thunk and usually lead to worse space requirements.

The solution

I think the culprit at this point is clear. It is uncurry. GHC wants to enforce the correct semantics of

uncurry (const (const 0)) undefined

However, this adds laziness and additional thunks. Your non-pointfree variant doesn't introduce that behaviour, since you pattern match on the pair:

foldr (\(cs', n) a -> …)

Still don't trust me? Use a lazy pattern match

foldr (\ ~(cs', n) a -> …)

and you will notice that matches will behave the same as matchesPF. So use slightly stricter variant of uncurry. uncurryI is enough to give the strictness analyzer a hint.

Note that pairs are notorious for this behaviour. RWH spents a whole chapter trying to optimize the behaviour of a single function where intermediate pairs lead to problems.

Zeta
  • 103,620
  • 13
  • 194
  • 236
  • Thank you so much! Awesome write up! I know that generally weird performance stuff is either due to laziness or sharing (too much or too little), but I totally forgot about `_|_` and `(_|_, _|_)` being totally different things. So I incorrectly assumed that using tuples couldn't cause any weird laziness related issues. – semicolon Jul 18 '16 at 15:33
  • I don't see what's particularly correct about `uncurry (const (const 0))` being `0`. I'm also annoyed by the excessive laziness of `bimap` for pairs. – dfeuer Jul 18 '16 at 18:19
  • @dfeuer [That's how it's defined in the report](https://www.haskell.org/onlinereport/haskell2010/haskellch9.html#x16-1710009). But yes, that kind of laziness is kind of unexpected. – Zeta Jul 18 '16 at 18:51
  • @dfeuer I mean Haskell by default generally minimizes strictness. And the least strict way to define `uncurry` is `uncurry f ~(x, y) = f x y`, which gives you `uncurry (const (const 0)) undefined == 0`. – semicolon Jul 19 '16 at 15:40
  • @semicolon, yes, I understand that line of reasoning. But lazy deconstruction of a tuple or record is generally best done by hand in the special cases that require it, although I'm given to understand this may not be the case for `Arrow` work. Unfortunately, some thought is usually required to figure out just where laziness is needed. – dfeuer Jul 19 '16 at 15:50
  • @dfeuer Yeah that is probably true. But just like with `genericLength` Haskell seems to default to maximally lazy even when you don't always want that in the interest of maximizing the number of correct programs at the cost of performance. – semicolon Jul 19 '16 at 16:02