6

I'm getting Heap exhausted message when running the following short Haskell program on a big enough dataset. For example, the program fails (with heap overflow) on 20 Mb input file with around 900k lines. The heap size was set (through -with-rtsopts) to 1 Gb. It runs ok if longestCommonSubstrB is defined as something simpler, e.g. commonPrefix. I need to process files in the order of 100 Mb.

I compiled the program with the following command line (GHC 7.8.3):

ghc -Wall -O2 -prof -fprof-auto "-with-rtsopts=-M512M -p -s -h -i0.1" SampleB.hs

I would appreciate any help in making this thing run in a reasonable amount of space (in the order of the input file size), but I would especially appreciate the thought process of finding where the bottleneck is and where and how to force the strictness.

My guess is that somehow forcing longestCommonSubstrB function to evaluate strictly would solve the problem, but I don't know how to do that.

{-# LANGUAGE BangPatterns #-}
module Main where 
import System.Environment (getArgs)
import qualified Data.ByteString.Lazy.Char8 as B
import Data.List (maximumBy, sort)
import Data.Function (on)
import Data.Char (isSpace)

-- | Returns a list of lexicon items, i.e. [[w1,w2,w3]]
readLexicon :: FilePath -> IO [[B.ByteString]]
readLexicon filename = do 
    text <- B.readFile filename
    return $ map (B.split '\t' . stripR) . B.lines $ text
    where
        stripR = B.reverse . B.dropWhile isSpace . B.reverse

transformOne :: [B.ByteString] -> B.ByteString
transformOne (w1:w2:w3:[]) = 
    B.intercalate (B.pack "|") [w1, longestCommonSubstrB w2 w1, w3]
transformOne a = error $ "transformOne: unexpected tuple " ++ show a

longestCommonSubstrB :: B.ByteString -> B.ByteString -> B.ByteString
longestCommonSubstrB xs ys = maximumBy (compare `on` B.length) . concat $ 
    [f xs' ys | xs' <- B.tails xs] ++
    [f xs ys' | ys' <- tail $ B.tails ys]
  where f xs' ys' = scanl g B.empty $ B.zip xs' ys'
        g z (x, y) = if x == y 
                then z `B.snoc` x
                else B.empty

main :: IO ()
main = do
    (input:output:_) <- getArgs
    lexicon <- readLexicon input
    let flattened = B.unlines . sort . map transformOne $ lexicon
    B.writeFile output flattened 

This is the profile ouput for the test dataset (100k lines, heap size set to 1 GB, i.e. generateSample.exe 100000, the resulting file size is 2.38 MB):

Heap profile over time:

Memory usage over time

Execution statistics:

   3,505,737,588 bytes allocated in the heap
     785,283,180 bytes copied during GC
      62,390,372 bytes maximum residency (44 sample(s))
         216,592 bytes maximum slop
              96 MB total memory in use (0 MB lost due to fragmentation)

                                    Tot time (elapsed)  Avg pause  Max pause
  Gen  0      6697 colls,     0 par    1.05s    1.03s     0.0002s    0.0013s
  Gen  1        44 colls,     0 par    4.14s    3.99s     0.0906s    0.1935s

  INIT    time    0.00s  (  0.00s elapsed)
  MUT     time    7.80s  (  9.17s elapsed)
  GC      time    3.75s  (  3.67s elapsed)
  RP      time    0.00s  (  0.00s elapsed)
  PROF    time    1.44s  (  1.35s elapsed)
  EXIT    time    0.02s  (  0.00s elapsed)
  Total   time   13.02s  ( 12.85s elapsed)

  %GC     time      28.8%  (28.6% elapsed)

  Alloc rate    449,633,678 bytes per MUT second

  Productivity  60.1% of total user, 60.9% of total elapsed

Time and Allocation Profiling Report:

       SampleB.exe +RTS -M1G -p -s -h -i0.1 -RTS sample.txt sample_out.txt

    total time  =        3.97 secs   (3967 ticks @ 1000 us, 1 processor)
    total alloc = 2,321,595,564 bytes  (excludes profiling overheads)

COST CENTRE            MODULE  %time %alloc

longestCommonSubstrB   Main     43.3   33.1
longestCommonSubstrB.f Main     21.5   43.6
main.flattened         Main     17.5    5.1
main                   Main      6.6    5.8
longestCommonSubstrB.g Main      5.0    5.8
readLexicon            Main      2.5    2.8
transformOne           Main      1.8    1.7
readLexicon.stripR     Main      1.8    1.9


                                                                            individual     inherited
COST CENTRE                  MODULE                       no.     entries  %time %alloc   %time %alloc

MAIN                         MAIN                          45           0    0.1    0.0   100.0  100.0
 main                        Main                          91           0    6.6    5.8    99.9  100.0
  main.flattened             Main                          93           1   17.5    5.1    89.1   89.4
   transformOne              Main                          95      100000    1.8    1.7    71.6   84.3
    longestCommonSubstrB     Main                         100      100000   43.3   33.1    69.8   82.5
     longestCommonSubstrB.f  Main                         101     1400000   21.5   43.6    26.5   49.5
      longestCommonSubstrB.g Main                         104     4200000    5.0    5.8     5.0    5.8
  readLexicon                Main                          92           1    2.5    2.8     4.2    4.8
   readLexicon.stripR        Main                          98           0    1.8    1.9     1.8    1.9
 CAF                         GHC.IO.Encoding.CodePage      80           0    0.0    0.0     0.0    0.0
 CAF                         GHC.IO.Encoding               74           0    0.0    0.0     0.0    0.0
 CAF                         GHC.IO.FD                     70           0    0.0    0.0     0.0    0.0
 CAF                         GHC.IO.Handle.FD              66           0    0.0    0.0     0.0    0.0
 CAF                         System.Environment            65           0    0.0    0.0     0.0    0.0
 CAF                         Data.ByteString.Lazy.Char8    54           0    0.0    0.0     0.0    0.0
 CAF                         Main                          52           0    0.0    0.0     0.0    0.0
  transformOne               Main                          99           0    0.0    0.0     0.0    0.0
  readLexicon                Main                          96           0    0.0    0.0     0.0    0.0
   readLexicon.stripR        Main                          97           1    0.0    0.0     0.0    0.0
  main                       Main                          90           1    0.0    0.0     0.0    0.0

UPDATE: The following program can be used to generate sample data. It expects one argument, the number of lines in the generated dataset. The generated data will be saved to the sample.txt file. When I generate 900k lines dataset with it (by running generateSample.exe 900000), the produced dataset makes the above program fail with heap overflow (the heap size was set to 1 GB). The resulting dataset is around 20 MB.

module Main where 
import System.Environment (getArgs)
import Data.List (intercalate, permutations)

generate :: Int -> [(String,String,String)]
generate n = take n $ zip3 (f "banana") (f "ruanaba") (f "kikiriki")
    where
        f = cycle . permutations

main :: IO ()
main = do
    (n:_) <- getArgs
    let flattened = unlines . map f $ generate (read n :: Int)
    writeFile "sample.txt" flattened 
    where
        f (w1,w2,w3) = intercalate "\t" [w1, w2, w3]
trincot
  • 317,000
  • 35
  • 244
  • 286
Glaukon
  • 321
  • 3
  • 12
  • 4
    Well `sort` cannot run in constant space: it needs to consume (and retain) its entire input before producing any output. – Reid Barton Sep 11 '15 at 12:56
  • 1
    While I don't think GHC has anything to do with the problem this time, you should always include the GHC version in the question text along with the profiler report. – dfeuer Sep 11 '15 at 14:55
  • @dfeuer GHC version 7.8.3 – Glaukon Sep 11 '15 at 17:36
  • @ReidBarton I know, I meant to say that to my imperative mind the space requirement for the function like `longestCommonSubstrB` should be space limited, i.e. not dependant on the size of the input file, but only on the size of its parameters. Of course, lazyness makes a difference, but also there should be a way to control it. – Glaukon Sep 11 '15 at 17:55
  • 1
    Why do you think that is not the case here? The profiling output you show is about the total amount of memory allocated by each function in your program. It says nothing about the lifetimes of those allocations or how much data is live on the heap at any one time (the "space requirement"). – Reid Barton Sep 11 '15 at 18:17
  • Similarly, to reverse the bytestring the whole file will need read. – Thomas M. DuBuisson Sep 11 '15 at 18:33
  • @ReidBarton The program fails (heap overflow) on 20 Mb input file with around 900k lines. The heap size was set to 2 Gb. It runs ok if longestCommonSubstrB is defined as something simpler, e.g. commonPrefix. I need to process files in the order of 100 Mb. – Glaukon Sep 11 '15 at 21:12

1 Answers1

1

It seems to me you've implemented a naive longest common substring, with terrible space complexity (at least O(n^2)). Strictness has nothing to do with it.

You'll want to implement a dynamic programming algo. You may find inspiration in the string-similarity package, or in the lcs function in the guts of the Diff package.

bummi
  • 27,123
  • 14
  • 62
  • 101
sclv
  • 38,665
  • 7
  • 99
  • 204