I have the following minimal example:
import Test.Tasty.Bench
{-# INLINE loop #-}
loop :: Int -> Int -> Int
loop a 0 = a
loop a n = loop (a + x + y) (n - 1) where (x, y) = foo n
{-# INLINE foo #-}
foo :: Int -> (Int, Int)
foo n = if n > 0 then (n + 1, n) else foo (n + 1)
main :: IO ()
main = defaultMain [bench "test" $ whnf (loop 0) (1024 * 1024)]
Running it gives me:
test: OK (0.24s)
1.88 ms ± 88 μs, 32 MB allocated, 2.8 KB copied, 2.0 MB peak memory
I want to avoid the 32 MB heap allocation. Looking at the core dump, I find the following worker functions:
Rec {
-- RHS size: {terms: 18, types: 6, coercions: 0, joins: 0/0}
$wfoo
= \ ww ->
case ># ww 0# of {
__DEFAULT -> $wfoo (+# ww 1#);
1# -> (# I# (+# ww 1#), I# ww #)
}
end Rec }
Rec {
-- RHS size: {terms: 26, types: 14, coercions: 0, joins: 0/0}
$wloop
= \ ww ww1 ->
case ww1 of ds {
__DEFAULT ->
case $wfoo ds of { (# ww3, ww4 #) ->
case ww3 of { I# y ->
case ww4 of { I# y1 -> $wloop (+# (+# ww y) y1) (-# ds 1#) }
}
};
0# -> ww
}
end Rec }
I am pretty sure that the heap allocation is caused by $wfoo
returning lifted Int
values. I have tried various strictness annotations to coax GHC into generating a worker function that returns unlifted values, but without success. For instance, the following causes no changes in the core dump other than renaming bound variables:
loop a n = seq x $ seq y $ loop (a + x + y) (n - 1) where (x, y) = foo n
If I drop the second component of the tuple and have foo
only return a single Int
, GHC immediately removes all the lifted Int
values and the resulting program uses no heap allocation.
I have also been able to avoid the lifted Int
values by using a datatype with strict fields like data Pair = Pair !Int !Int
(although I have not been able to do this with a polymorphic strict pair). Curiously enough, this datatype does not even appear in the core, which uses only unboxed tuples of unlifted values.
In my production code, I need the function to work with tuples, so these solutions do not work for me. Since the datatype with strict fields ends up getting erased anyways, it seems to me that it only serves as a convoluted way of making strictness annotations. I assume that there is a more direct way to make those same strictness annotations that get GHC to generate a worker function that returns unlifted values.
How can I make GHC generate a worker function for foo
that returns unlifted values and thus avoid the heap allocation associated with lifted values?
I am using GHC 8.10.7 and LLVM 13.0.1. Built with -O2 -fllvm -optlo-O3
, run with +RTS -T
.