3

I'm using Data.Text.Lazy to process some text files. I read in 2 files and distribute their text to 3 files according to some criteria. The loop which does the processing is go'. I've designed it in a way in which it should process the files incrementally and keep nothing huge in memory. However, as soon as the execution reaches the go' part the memory keeps on increasing till it reaches around 90MB at the end, starting from 2MB.

Can someone explain why this memory increase happens and how to avoid it?

import qualified Data.Text.Lazy as T
import qualified Data.Text.Lazy.IO as TI
import System.IO
import System.Environment
import Control.Monad

main = do
  [in_en, in_ar] <- getArgs
  [h_en, h_ar] <- mapM (`openFile` ReadMode) [in_en, in_ar]
  hSetEncoding h_en utf8
  en_txt <- TI.hGetContents h_en
  let len = length $ T.lines en_txt
  len `seq` hClose h_en
  h_en <- openFile in_en ReadMode
  hs@[hO_lm, hO_en, hO_ar] <- mapM (`openFile` WriteMode) ["lm.txt", "tun_"++in_en, "tun_"++in_ar]
  mapM_ (`hSetEncoding` utf8) [h_en, h_ar, hO_lm, hO_en, hO_ar]
  [en_txt, ar_txt] <- mapM TI.hGetContents [h_en, h_ar]
  let txts@[_, _, _] = map T.unlines $ go len en_txt ar_txt
  zipWithM_ TI.hPutStr hs txts
  mapM_ (liftM2 (>>) hFlush hClose) hs
  print "success"
  where
    go len en_txt ar_txt = go' (T.lines en_txt) (T.lines ar_txt)
      where (q,r) = len `quotRem` 3000
            go' [] [] = [[],[],[]]
            go' en ar = let (h:bef, aft)    = splitAt q en 
                            (hA:befA, aftA) = splitAt q ar 
                            ~[lm,en',ar']   = go' aft aftA
                        in [bef ++ lm, h:en', hA:ar']

EDIT

As per @kosmikus's suggestion I've tried replacing zipWithM_ TI.hPutStr hs txts with a loop which prints line by line as shown below. The memory consumption is now 2GB+!

fix (\loop lm en ar -> do
  case (en,ar,lm) of
    ([],_,lm) -> TI.hPutStr hO_lm $ T.unlines lm
    (h:t,~(h':t'),~(lh:lt)) -> do
      TI.hPutStrLn hO_en h
      TI.hPutStrLn hO_ar h'
      TI.hPutStrLn hO_lm lh
      loop lt t t')
  lm en ar

What's going on here?

haskelline
  • 1,116
  • 7
  • 15
  • Can you provide the file you use for testing? I'm not certain what inputs should succeed, each file I try gives a pattern match error at some point. – user2407038 Nov 25 '13 at 04:56
  • @user2407038 the 2 input files should have the exactly the same number of lines. I believe that's the pattern match error you're facing. – haskelline Nov 25 '13 at 05:12

1 Answers1

5

The function go' builds a [T.Text] with three elements. The list is built lazily: in each step of go each of the three lists becomes known to a certain extent. However, you consume this structure by printing each element to a file in order, using the line:

zipWithM_ TI.hPutStr hs txts

So the way you consume the data does not match the way you produce the data. While printing the first of the three list elements to a file, the other two are built and kept in memory. Hence the space leak.

Update

I think that for the current example, the easiest fix would be to write to the target files during the loop, i.e., in the go' loop. I'd modify go' as follows:

go' :: [T.Text] -> [T.Text] -> IO ()
go' [] [] = return ()
go' en ar = let (h:bef, aft)    = splitAt q en
                (hA:befA, aftA) = splitAt q ar
            in do
              TI.hPutStrLn hO_en h
              TI.hPutStrLn hO_ar hA
              mapM_ (TI.hPutStrLn hO_lm) bef
              go' aft aftA

And then replace the call to go and the subsequent zipWithM_ call with a plain call to:

go hs len en_txt ar_txt
kosmikus
  • 19,549
  • 3
  • 51
  • 66
  • Nice catch. I thought they'll not take up memory because they are not accessed. But I'm still worried since the first element of the list is supposed to be much larger than the other two. The other two should not exceed around 3000 (with some remainder) lines. Do two `Text`s of 3000 lines take up 90MB of memory? – haskelline Nov 25 '13 at 11:38
  • It's hard for me to make qualitative guesses without really having access to the files you're running it on. However, it's easy enough for you to replace the `zipWithM_` by `TI.hPutStr (head hs) (head txts)` and see how much improvement that yields. – kosmikus Nov 25 '13 at 12:14
  • I've tried replacing it with a loop which processes the result line by line. As mentioned in the edit the memory consumption is much more worse (2GB+). What do you think? – haskelline Nov 25 '13 at 15:51
  • That's not going to help, because as you say yourself, the first list is much larger. So that will be built up in memory now. Why don't you just print from `go`? – kosmikus Nov 25 '13 at 16:23
  • I was trying to keep the pure part separate. Is there really not any other solution? – haskelline Nov 25 '13 at 16:26
  • I don't see you win a lot by keeping the "pure part" separate here. I've edited my solution to make a concrete suggestion. – kosmikus Nov 25 '13 at 16:32
  • 2
    It works in constant space now. The reason I wanted to keep the pure part separate was if I wanted to complicate the processing later in the future. It would be really annoying carrying IO everywhere. But yes in this simple case it's not a win. – haskelline Nov 25 '13 at 17:07