Read large lines in huge file without buffering

Question

I was wondering if there's an easy way to get lines one at a time out of a file without eventually loading the whole file in memory. I'd like to do a fold over the lines with an attoparsec parser. I tried using Data.Text.Lazy.IO with hGetLine and that blows through my memory. I read later that eventually loads the whole file.

I also tried using pipes-text with folds and view lines:

s <- Pipes.sum $ 
    folds (\i _ -> (i+1)) 0 id (view Text.lines (Text.fromHandle handle))
print s

to just count the number of lines and it seems to be doing some wonky stuff "hGetChunk: invalid argument (invalid byte sequence)" and it takes 11 minutes where wc -l takes 1 minute. I heard that pipes-text might have some issues with gigantic lines? (Each line is about 1GB)

I'm really open to any suggestions, can't find much searching except for newbie readLine how-tos.

Thanks!

You are using Pipes.Text.IO for input rather than Pipes.Bytestring+ decoding as the library advises. The error message is from the text library which is making a judgment about system decoding for each chunk. I assume it is saying it can't understand the chunk according to whatever it thinks the encoding is. — Michael, Mar 08 '17 at 18:18

score 7 · Accepted Answer · answered Mar 08 '17 at 16:17

7

The following code uses Conduit, and will:

UTF8-decode standard input
Run the lineC combinator as long as there is more data available
For each line, simply yield the value 1 and discard the line content, without ever read the entire line into memory at once
Sum up the 1s yielded and print it

You can replace the yield 1 code with something which will do processing on the individual lines.

#!/usr/bin/env stack
-- stack --resolver lts-8.4 --install-ghc runghc --package conduit-combinators
import Conduit

main :: IO ()
main = (runConduit
     $ stdinC
    .| decodeUtf8C
    .| peekForeverE (lineC (yield (1 :: Int)))
    .| sumC) >>= print

answered Mar 08 '17 at 16:17

Michael Snoyman

31,100
3
48
77

Not sure if this is the best place to ask, but when I do a foldC in place of sumC over my parsed lines on a 3 int constant size monoid, it seems to blow out all my memory again. Am I leaving anything out? I also tried foldlC. – Charles Durham Mar 10 '17 at 18:56
Nevermind, did a deepseq instead of seq and everything worked fine – Charles Durham Mar 10 '17 at 19:08

score 3 · Answer 2 · edited Mar 08 '17 at 22:36

This is probably easiest as a fold over the decoded text stream

{-#LANGUAGE BangPatterns #-}
import Pipes 
import qualified Pipes.Prelude as P
import qualified Pipes.ByteString as PB
import qualified Pipes.Text.Encoding as PT
import qualified Control.Foldl as L
import qualified Control.Foldl.Text as LT
main = do
  n <- L.purely P.fold (LT.count '\n') $ void $ PT.decodeUtf8 PB.stdin
  print n

It takes about 14% longer than wc -l for the file I produced which was just long lines of commas and digits. IO should properly be done with Pipes.ByteString as the documentation says, the rest is conveniences of various sorts.

You can map an attoparsec parser over each line, distinguished by view lines, but keep in mind that an attoparsec parser can accumulate the whole text as it pleases and this might not be a great idea over a 1 gigabyte chunk of text. If there is a repeated figure on each line (e.g. word separated numbers) you can use Pipes.Attoparsec.parsed to stream them.

Thanks! I'll give this a try too when I get a chance – Charles Durham Mar 09 '17 at 17:53 — Charles Durham, Mar 09 '17 at 17:53

Read large lines in huge file without buffering

2 Answers2