1

This is the dual question of Performance considerations of Haskell FFI / C?: I would like to call a C function with as small an overhead as possible.

To set the scene, I have the following C function:

typedef struct
{
    uint64_t RESET;
} INPUT;

typedef struct
{
    uint64_t VGA_HSYNC;
    uint64_t VGA_VSYNC;
    uint64_t VGA_DE;
    uint8_t VGA_RED;
    uint8_t VGA_GREEN;
    uint8_t VGA_BLUE;
} OUTPUT;

void Bounce(const INPUT* input, OUTPUT* output);

Let's run it from C and time it, with gcc -O3:

int main (int argc, char **argv)
{
    INPUT input;
    input.RESET = 0;
    OUTPUT output;

    int cycles = 0;

    for (int j = 0; j < 60; ++j)
    {
        for (;; ++cycles)
        {
            Bounce(&input, &output);
            if (output.VGA_HSYNC == 0 && output.VGA_VSYNC == 0) break;
        }

        for (;; ++cycles)
        {
            Bounce(&input, &output);
            if (output.VGA_DE) break;
        }
    }

    printf("%d cycles\n", cycles);
}

Running it for 25152001 cycles takes ~400 ms:

$ time ./Bounce
25152001 cycles

real    0m0.404s
user    0m0.403s
sys     0m0.001s

Now let's write some Haskell code to set up FFI (note that Bool's Storable instance really does use a full int):

data INPUT = INPUT
    { reset :: Bool
    }

data OUTPUT = OUTPUT
    { vgaHSYNC, vgaVSYNC, vgaDE :: Bool
    , vgaRED, vgaGREEN, vgaBLUE :: Word64
    }
    deriving (Show)

foreign import ccall unsafe "Bounce" topEntity :: Ptr INPUT -> Ptr OUTPUT -> IO ()

instance Storable INPUT where ...
instance Storable OUTPUT where ...

And let's do what I believe to be functionally equivalent to our C code from before:

main :: IO ()
main = alloca $ \inp -> alloca $ \outp -> do
    poke inp $ INPUT{ reset = False }

    let loop1 n = do
            topEntity inp outp
            out@OUTPUT{..} <- peek outp
            let n' = n + 1
            if not vgaHSYNC && not vgaVSYNC then loop2 n' else loop1 n'
        loop2 n = do
            topEntity inp outp
            out <- peek outp
            let n' = n + 1
            if vgaDE out then return n' else loop2 n'
        loop3 k n
          | k < 60 = do
              n <- loop1 n
              loop3 (k + 1) n
          | otherwise = return n

    n <- loop3 (0 :: Int) (0 :: Int)
    printf "%d cycles" n

I build it with GHC 8.6.5, using -O3, and I get.. more than 3 seconds!

$ time ./.stack-work/dist/x86_64-linux/Cabal-2.4.0.1/build/sim-ffi/sim-ffi
25152001 cycles

real   0m3.468s
user   0m3.146s
sys    0m0.280s

And it's not a constant overhead at startup, either: if I run for 10 times the cycles, I get roughly 3.5 seconds from C and 34 seconds from Haskell.

What can I do to reduce the Haskell -> C FFI overhead?

HTNW
  • 27,182
  • 1
  • 32
  • 60
Cactus
  • 27,075
  • 9
  • 69
  • 149
  • I tried to reproduce this, but it requires too much effort. Can you somehow provide a complete example? https://stackoverflow.com/help/minimal-reproducible-example – chi Feb 21 '20 at 14:44
  • "What can I do to reduce the Haskell -> C FFI overhead?" I mean, the main thing would be doing less of it. You keep marshalling/unmarshalling things in a tight loop, so of course that’s going to be slow. – Cubic Feb 21 '20 at 17:30
  • Do you ever need the Haskell structure in haskell? If you can just keep it as a pointer to memory and have a few test functions such as `vgaVSYNC :: Ptr INPUT -> IO Bool` then that will save a log of copying, allocation, GC work on every call. – Thomas M. DuBuisson Feb 21 '20 at 18:20
  • @ThomasM.DuBuisson: please see my answer on why that is not a problem once `peek` is inlined. – Cactus Feb 22 '20 at 01:55
  • @Cubic: but the whole point of the eventual real program I am writing (not this stripped-down one) is to process the fields of `OUTPUT` for every single iteration of `topEntity`. – Cactus Feb 22 '20 at 01:56

1 Answers1

2

I managed to reduce the overhead so that the 25 M calls now finish in 1.2 seconds. The changes were:

  1. Make loop1, loop2 and loop3 strict in the n argument (using BangPatterns)
  2. Add an INLINE pragma to peek in OUTPUT's Storable instance

Point #1 is silly, of course, but that's what I get for not profiling earlier. That change alone gets me to 1.5 seconds....

Point #2, however, makes a ton of sense and is generally applicable. It also addresses the comment from @Thomas M. DuBuisson:

Do you ever need the Haskell structure in haskell? If you can just keep it as a pointer to memory and have a few test functions such as vgaVSYNC :: Ptr OUTPUT -> IO Bool then that will save a log of copying, allocation, GC work on every call.

In the eventual full program, I do need to look at all the fields of OUTPUT. However, with peek inlined, GHC is happy to do the case-of-case transformation, so I can see in Core that now there is no OUTPUT value allocated; the output of peek is consumed directly.

Cactus
  • 27,075
  • 9
  • 69
  • 149