2

What I'm trying to do is use takeWhile to split a bytestring by some character.

import qualified Data.ByteString.Internal as BS (c2w, w2c)
import Pipes
import Pipes.ByteString as PB
import Pipes.GZip
import Pipes.Prelude as PP
import System.IO

newline = BS.c2w '\n'

splitter = PB.takeWhile (\myWord -> myWord /= newline)

myPipe fileHandle = PP.toListM $ decompress fileProducer >-> splitter
  where
    fileProducer = PB.fromHandle fileHandle       

run = do
  dat <- withFile "somefile.blob" ReadMode myPipe
  pure dat

This gets me the first line, but what I really want is to effectively yield each chunk up to a newline character at a time. How do I do that?

daj
  • 6,962
  • 9
  • 45
  • 79
  • Here's a similar question http://stackoverflow.com/questions/25982213/using-haskell-pipes-bytestring-to-iterate-a-file-by-line/27727521#27727521 – Michael Jun 05 '16 at 15:29

2 Answers2

3

@Michael's answer is good. I just want to illustrate some usage patterns that are going on here.

( .lhs available at http://lpaste.net/165352 )

First a few imports:

 {-# LANGUAGE OverloadedStrings, NoMonomorphismRestriction #-}

 import Pipes
 import qualified Pipes.Prelude as PP
 import qualified Pipes.Group as PG
 import qualified Pipes.ByteString as PB
 import qualified Pipes.GZip as GZip
 import qualified Data.ByteString as BS
 import Lens.Family (view, over)
 import Control.Monad
 import System.IO

If you look over the functions in Pipes.ByteString and Pipes.GZip you'll see that they all into the following type schemas:

  1. Producer ... -> FreeT (Producer ...) ...
  2. FreeT (Producer ...) ... -> Producer ...
  3. Lens' (Producer ...) (FreeT (Producer ...) ...)
  4. Producer ... -> Producer ...

Examples of functions in each category:

  1. PB.words
  2. PG.concats
  3. PB.lines, PB.chunksOf, PB.splits, ...
  4. GZip.compress, GZip.decompress

Here's how to use PB.words to split an input stream into words:

 prod = yield "this is\na test\nof the pipes\nprocessing\nsystem"

 t1 = runEffect $ (PG.concats . PB.words) prod >-> PP.print

To use a function of type 3 -- e.g. PB.lines, just use view on the Lens' to get a function of type 1 and then compose with PG.concats:

 t2a = runEffect $ (PG.concats . view PB.lines) prod >-> PP.print

 t2b h = (PG.concats . view PB.lines) (PB.fromHandle h) >-> PP.print

 run2 = withFile "input" ReadMode (runEffect . t2b)

For a Producer -> Producer function, just use normal function application:

 t3 h = GZip.decompress (PB.fromHandle h) >-> PP.print

 run3 = withFile "input.gz" ReadMode (runEffect . t3)

 t4 h = GZip.decompress (PB.fromHandle h) >-> PP.map BS.length >-> PP.print

 run4 = withFile "big.gz" ReadMode (runEffect . t4)

To first decompress and then split by lines, we nest function application:

 t5 h = (PG.concats . view PB.lines) ( GZip.decompress (PB.fromHandle h) )
          >-> PP.map BS.length >-> PP.print

 run5 = withFile "input.gz" ReadMode (runEffect . t5)
ErikR
  • 51,541
  • 9
  • 73
  • 124
  • Thanks for explaining this. Why is some functionality provided as a lens view vs. providing transformation functions directly? I understand lenses provide more generality, but the design is not obvious beyond that. What does it mean that the lenses are described as "improper" in the haddocks? – daj Jun 05 '16 at 15:47
  • Those are good questions for http://stackoverflow.com/users/1026598/gabriel-gonzalez ! Another place to ask questions specifically about pipes is the [pipes mailing list](https://groups.google.com/forum/#!forum/haskell-pipes) – ErikR Jun 05 '16 at 16:10
2

pipes-bytestring and pipes-group are arranged so that repeatedly breaking a Producer ByteString m r yields a FreeT (Producer ByteString m) m r. FreeT can here be read to mean A_Succession_Of, so the result can be thought of as 'a succession of bytestring-producer segments returning an r'. This way if one of the segments is, say, 10 gibabytes long, we still have streaming rather than a 10 gigabyte strict bytestring.

It looks to me that you want to break the bytestring producer on newlines, but I couldn't tell if you wanted to keep the newlines. If you are throwing them out, this is the same as splitting the bytestring producer with view PB.lines, followed by concatenating each subordinate producer into a single strict bytestring - the individual line. I wrote this below as accumLines. It is straightforward, but makes a tiny use of Lens.view to turn the fancy PB.lines lens into a regular function. (Many operations are written as lenses in pipes-bytestring because then they can be re-used for other purpose, especially the kind of producer parsing pipes favors.)

import Pipes
import qualified Pipes.Prelude as P
import Pipes.ByteString as PB
import qualified Pipes.Group as PG
import Pipes.GZip

import qualified Data.ByteString.Internal as BS (c2w, w2c)

import System.IO
import Lens.Simple (view) -- or Control.Lens or whatever
import Data.Monoid

main = run >>= mapM_ print

myPipe fileHandle = P.toListM $ accumLines (decompress fileProducer)
  where
    fileProducer = PB.fromHandle fileHandle

run = do
  dat <- withFile "a.gz" ReadMode myPipe
  pure dat

-- little library additions

accumLines :: Monad m => Producer ByteString m r -> Producer ByteString m r
accumLines = mconcats . view PB.lines 

accumSplits :: Monad m => Char -> Producer ByteString m r -> Producer ByteString m r
accumSplits c  = mconcats . view (PB.splits (BS.c2w c)) 

-- this is convenient, but the operations above could 
-- be more rationally implemented using e.g. BL.fromChunks and toListM 
mconcats :: (Monad m, Monoid b) => FreeT (Producer b m) m r -> Producer b m r
mconcats = PG.folds (<>) mempty id

Ideally you would not write a new bytestring at each line break. Whether you have to depends on what you were going to do with the lines.

Michael
  • 2,889
  • 17
  • 16
  • Is `PG.concats` is the same as your `mconcats` ? – ErikR Jun 04 '16 at 19:31
  • No, `PG.concats` just removes the `FreeT` breaks in the succession, so it doesn't have a monoid constraint. The analogy to `concat` follows the general analogy `FreeT (Producer m) m r : Producer a m r :: [[a]] : [a]` My little `mconcats` crushes each successive producer of monoidal values into a summary monoidal value - that is, it does a number of little mconcat-like operations. Maybe it's not the best name. – Michael Jun 04 '16 at 21:27
  • i'm running every n lines in a parser (probably attoparsec or megaparsec) to build records. it would seem like i would have to build individual bytestrings (at least one per record for that). or is there a way to avoid that? – daj Jun 05 '16 at 15:51
  • If you have written a single attoparsec parser that succeeds on a given n-line segment and returns a record, then you can repeatedly apply it directly to the bytestring producer, without breaking it into lines. See [`Pipes.Attoparsec.decoded`](https://hackage.haskell.org/package/pipes-attoparsec-0.5.1.3/docs/Pipes-Attoparsec.html#v:parsed). So, `parsed my_n_line_parser . decompress` will stream the records as they come from the compressed file; if it fails it will return the rest of the (decompressed) producer of bytestrings with messages. If it fits the task this may be simplest. – Michael Jun 05 '16 at 19:33