4

I'm having a hard time to write a pipe with this signature:

toOneBigList :: (Monad m, Proxy p) => () -> Pipe p a [a] m r

It should simply take all as from upstream and send them in a list downstream.

All my attempts look fundamentally broken.

Can anybody point me in the right direction?

Giacomo Tesio
  • 7,144
  • 3
  • 31
  • 48

2 Answers2

9

There are two pipes-based solutions and I'll let you pick which one you prefer.

Note: It's not clear why you output the list on the downstream interface instead of just returning it directly as a result.

Conduit-style

The first one, which is very close to the conduit-based solution uses the upcoming pipes-pase, which is basically complete and just needs documentation. You can find the latest draft on Github.

Using pipes-parse, the solution is identical to the conduit solution that Petr gave:

import Control.Proxy
import Control.Proxy.Parse

combine
    :: (Monad m, Proxy p)
    => () -> Pipe (StateP [Maybe a] p) (Maybe a) [a] m ()
combine () = loop []
  where
    loop as = do
        ma <- draw
        case ma of
            Nothing -> respond (reverse as)
            Just a  -> loop (a:as)

draw is like conduit's await: it requests a value from either the leftovers buffer (that's the StateP part) or from upstream if the buffer is empty. Nothing indicates end of file.

You can wrap a pipe that does not have an end of file signal using the wrap function from pipes-parse, which has type:

wrap :: (Monad m, Proxy p) => p a' a b' b m r -> p a' a b' (Maybe b) m s

Classic Pipes Style

The second alternative is a bit simpler. If you want to fold a given pipe you can do so directly using WriterP:

import Control.Proxy
import Control.Proxy.Trans.Writer

foldIt
  :: (Monad m, Proxy p) =>
     (() -> Pipe p a b m ()) -> () -> Pipe p a [b] m ()
foldIt p () = runIdentityP $ do
    r <- execWriterK (liftP . p >-> toListD >-> unitU) ()
    respond r

That's a higher-level description of what is going on, but it requires passing in the pipe as an explicit argument. It's up to you which one you prefer.

By the way, this is why I was asking why you want to send a single value downstream. The above is much simpler if you return the folded list:

foldIt p = execWriterK (liftP . p >-> toListD)

The liftP might not even be necessary if p is completely polymorphic in its proxy type. I only include it as a precaution.

Bonus Solution

The reason pipes-parse does not provide the toOneBigList is that it's always a pipes anti-pattern to group the results into a list. pipes has several nice features that make it possible to never have to group the input into a list, even if you are trying to yield multiple lists. For example, using respond composition you can have a proxy yield the subset of the stream it would have traversed and then inject a handler that uses that subset:

example :: (Monad m, Proxy p) => () -> Pipe p a (() -> Pipe p a a m ()) m r
example () = runIdentityP $ forever $ do
    respond $ \() -> runIdentityP $ replicateM_ 3 $ request () >>= respond

printIt :: (Proxy p, Show a) => () -> Pipe p a a IO r
printIt () = runIdentityP $ do
    lift $ putStrLn "Here we go!"
    printD ()

useIt :: (Proxy p, Show a) => () -> Pipe p a a IO r
useIt = example />/ (\p -> (p >-> printIt) ())

Here's an example of how to use it:

>>> runProxy $ enumFromToS 1 10 >-> useIt
Here we go!
1
2
3
Here we go!
4
5
6
Here we go!
7
8
9
Here we go!
10

This means you never need to bring a single element into memory even when you need to group elements.

Gabriella Gonzalez
  • 34,863
  • 3
  • 77
  • 135
  • foldIt is what I'm looking for. I understand that using a list is an antipattern, but afaik, I need to build it to store it in a **binary** file. The whole problem is: take a large (>20000) list of large files (>20Mb), compute a few statistics from each (10 Double for each file), and store such results in a binary file. Still, even with pipes and foldIt, the program is going out of memory. I'm doing something wrong somewhere else... – Giacomo Tesio May 28 '13 at 18:12
  • @GiacomoTesio That's because you are loading the list of files into memory which is what is causing the leak. I'm working on a `pipes-directory` as we speak which would stream the directory list for you to avoid this common problem. – Gabriella Gonzalez May 28 '13 at 18:55
  • Thanks a lot. But I don't think that those file names are the problem. I added a line with `lift $ putStrLn $ nameOf $ binaryDecodedSample` and all the files are actually listed in output. Indeed the process is consuming to much memory for the filenames (2Gb). – Giacomo Tesio May 28 '13 at 19:12
  • 1
    @GiacomoTesio Then the second culprit is that you are loading the list into memory before serializing it. You can serialize the list in constant memory (I know because I've done this). The trick to making it deserializable without wasting a lot of space is to incrementally serialize it in chunks of a maximum size (i.e. 1000 elements) and then prefix each chunk with its actual size. This means you never bring more than a fixed number of elements into memory. – Gabriella Gonzalez May 28 '13 at 20:37
  • 1
    Nice trick. But I'm going to move to a line based format (either CSV, Show based or JSON). This way I can write the line one at a time. At least, it should be easier. – Giacomo Tesio May 28 '13 at 21:40
  • I needed exactly this only a few days ago and came up with your "classic pipes solution". My use case is that I am using ProduceT as "ListT done right" to query a graph database. I reify subresults into a list which among other things allows some "negation as failure". – phischu Jun 03 '13 at 13:40
  • @phischu@ Yeah, I've been playing around with cool uses of ListT, too, like using it to traverse directory trees as effectful lists. Your use case is definitely a correct one, too: you can use "ListT done right" to do depth-first effectful searches. – Gabriella Gonzalez Jun 03 '13 at 16:25
2

I'll give only a partial answer, perhaps somebody else will have a better one.

As far as I know, standard pipes have no mechanism of detecting when the other part of the pipeline terminates. The first pipe that terminates produces the final result of the pipe-line and all the others are just dropped. So if you have a pipe that consumes input forever (to eventually produce a list), it will have no chance acting and producing output when its upstream finishes. (This is intentional so that both up- and down-stream parts are dual to each other.) Perhaps this is solved in some library building on top of pipes.

The situation is different with conduit. It has consume function that combines all inputs into a list and returns (not outputs) it. Writing a function like the one you need, that outputs the list at the end, is not difficult:

import Data.Conduit

combine :: (Monad m) => Conduit a m [a]
combine = loop []
  where
    loop xs = await >>= maybe (yield $ reverse xs) (loop . (: xs))
Petr
  • 62,528
  • 13
  • 153
  • 317
  • +1 thanks for the answer. Still I wait for a pipe based one... unless somebody knows that such pipe type is actually wrong. – Giacomo Tesio May 28 '13 at 12:55