4

I would like to parse an infinite stream of bytes into an infinite stream of Haskell data. Each byte is read from the network, thus they are wrapped into IO monad.

More concretely I have an infinite stream of type [IO(ByteString)]. On the other hand I have a pure parsing function parse :: [ByteString] -> [Object] (where Object is a Haskell data type)

Is there a way to plug my infinite stream of monad into my parsing function ?

For instance, is it possible to write a function of type [IO(ByteString)] -> IO [ByteString] in order for me to use my function parse in a monad?

András Kovács
  • 29,931
  • 3
  • 53
  • 99
abitbol
  • 487
  • 4
  • 8
  • 3
    As stated, this requires "lazy IO" -- you can google for that, but please note it comes with a few subtle issues, e.g. IO will be performed at essentially unpredictable points in the program. I'd rather try to rewrite `parse` to use pipes or conduit, or some other library for "IO streams". – chi Apr 05 '19 at 19:52
  • Ok thank you. What is the difference between `[IO ByteString]` and Haskell `Stream` for instance ? – abitbol Apr 05 '19 at 20:36
  • 1
    @abitbol Which `Stream` are you referring to? `pipes` and `conduit` are two different libraries which provide their own implementation of the concept of an infinite stream. – chepner Apr 05 '19 at 21:18
  • Actually, I only have a vague intuition about pipes and conduit streams, and I'm afraid that I might mislead you. I think you should be able to convert an `[IO a]` into a conduit `StreamSource IO a` or something like that. – chi Apr 05 '19 at 21:19
  • Aren't you looking for `sequence` function? – radrow Apr 05 '19 at 22:20
  • 1
    @radrow They aren't, because `sequence` with `IO` and infinite lists doesn't terminate. – duplode Apr 05 '19 at 23:27

1 Answers1

8

The Problem

Generally speaking, in order for IO actions to be properly ordered and behave predictably, each action needs to complete fully before the next action is run. In a do-block, this means that this works:

main = do
    sequence (map putStrLn ["This","action","will","complete"])
    putStrLn "before we get here"

but unfortunately this won't work, if that final IO action was important:

dontRunMe = do
    putStrLn "This is a problem when an action is"
    sequence (repeat (putStrLn "infinite"))
    putStrLn "<not printed>"

So, even though sequence can be specialized to the right type signature:

sequence :: [IO a] -> IO [a]

it doesn't work as expected on an infinite list of IO actions. You'll have no problem defining such a sequence:

badSeq :: IO [Char]
badSeq = sequence (repeat (return '+'))

but any attempt to execute the IO action (e.g., by trying to print the head of the resulting list) will hang:

main = (head <$> badSeq) >>= print

It doesn't matter if you only need a part of the result. You won't get anything out of the IO monad until the entire sequence is done (so "never" if the list is infinite).

The "Lazy IO" Solution

If you want to get data from a partially completed IO action, you need to be explicit about it and make use of a scary-sounding Haskell escape hatch, unsafeInterleaveIO. This function takes an IO action and "defers" it so that it won't actually execute until the value is demanded.

The reason this is unsafe in general is that an IO action that makes sense now, might mean something different if actually executed at a later time point. As a simple example, an IO action that truncates/removes a file has a very different effect if it's executed before versus after updated file contents are written!

Anyway, what you'd want to do here is write a lazy version of sequence:

import System.IO.Unsafe (unsafeInterleaveIO)

lazySequence :: [IO a] -> IO [a]
lazySequence [] = return []  -- oops, not infinite after all
lazySequence (m:ms) = do
  x <- m
  xs <- unsafeInterleaveIO (lazySequence ms)
  return (x:xs)

The key point here is that, when a lazySequence infstream action is executed, it will actually execute only the first action; the remaining actions will be wrapped up in a deferred IO action that won't truly execute until the second and subsequent elements of the returned list are demanded.

This works for fake IO actions:

> take 5 <$> lazySequence (repeat (return ('+'))
"+++++"
>

(where if you replaced lazySequence with sequence, it would hang). It also works for real IO actions:

> lns <- lazySequence (repeat getLine)
<waits for first line of input, then returns to prompt>
> print (head lns)
<prints whatever you entered>
> length (head (tail lns))  -- force next element
<waits for second line of input>
<then shows length of your second line before prompt>
>

Anyway, with this definition of lazySequence and types:

parse :: [ByteString] -> [Object]
input :: [IO ByteString]

you should have no trouble writing:

outputs :: IO [Object]
outputs = parse <$> lazySequence inputs

and then using it lazily however you want:

main = do
    objs <- outputs
    mapM_ doSomethingWithObj objs

Using Conduit

Even though the above lazy IO mechanism is pretty simple and straightforward, lazy IO has fallen out of favor for production code due to issues with resource management, fragility with respect to space leaks (where a small change to your code blows up the memory footprint), and problems with exception handling.

One solution is the conduit library. Another is pipes. Both are carefully designed streaming libraries that can support infinite streams.

For conduit, if you had a parse function that created one object per byte string, like:

parse1 :: ByteString -> Object
parse1 = ...

then given:

inputs :: [IO ByteString]
inputs = ...

useObject :: Object -> IO ()
useObject = ...

the conduit would look something like:

import Conduit

main :: IO ()
main = runConduit $  mapM_ yieldM inputs
                  .| mapC parse1
                  .| mapM_C useObject

Given that your parse function has signature:

parse :: [ByteString] -> [Object]

I'm pretty sure you can't integrate this with conduit directly (or at least not in any way that wouldn't toss out all the benefits of using conduit). You'd need to rewrite it to be conduit friendly in how it consumed byte strings and produced objects.

K. A. Buhr
  • 45,621
  • 3
  • 45
  • 71
  • 1
    Note that if we have a function like `parse1`, then the problem is much simpler. `fmap (fmap parse1) inputs :: [IO Object]` is already a huge step forwards. Rather, I would have expected that a main issue here is that we do not know how many bytestrings the parser has to consume to produce an object, and that the last bytetring might also be only partially consumed. Of course, a proper conduit parser should ignore such issues in a nice way, since it essentially awaits enough bytes before producing one object. – chi Apr 06 '19 at 00:26
  • Thank you for your very detailed response! It helps me to understand what is idiomatic in Haskell for such kind of IO treatement. – abitbol Apr 06 '19 at 05:30