attoparsec high memory usage reading huge file

Question

I've tried many approachs for parsing a file content line by line, but at the present time is not working and when it runs, uses a lot of memory (more than 16GB).

This is a subset of the file I want to parse http://lpaste.net/144719

I want three kinds of errors:

1) error with backtrace (multiple lines, the first of them is like 3))
2) single error with one more line
3) single line error

Here is my current code:

import qualified Data.ByteString as B
import Data.ByteString.Char8 as B8 hiding (lines, filter, unlines, head, readFile, take, length,
                                           putStrLn, tail, map, concat, or, writeFile, intersperse,
                                           groupBy, hGetContents)
import qualified Data.Text as T
import qualified Data.Text.IO as TIO
import Data.Attoparsec.Text hiding (take)
import Control.Applicative
import Control.Monad (replicateM, mapM)
import Data.Either (either)
import Data.List (intersperse, groupBy)
import System.Environment
import qualified System.IO as SIO

data TimeStamp = MkTimeStamp T.Text
               deriving Show

data LogFileInfo = BackTraceLineInfo T.Text
                 | BackTraceInfo TimeStamp T.Text T.Text [LogFileInfo]
                 | Error TimeStamp T.Text
                 | LargeError TimeStamp T.Text T.Text
                 deriving Show

data LineType = SingleLineError TimeStamp T.Text
              | DirectoryInfo T.Text
              | ErrorInfo T.Text
              | LineBackTraceInfo T.Text
              | BackTraceString T.Text
              | BackTraceLine T.Text
              deriving Show

parseTimeStamp :: Parser TimeStamp
parseTimeStamp = do
  year <- many digit
  char '-'
  month <- many digit
  char '-'
  day <- many digit
  char ' '
  hour <- many digit
  char ':'
  minute <- many digit
  char ':'
  second <- many digit
  char ' '
  (return . MkTimeStamp) $ T.pack $ year ++ "-" ++ month ++ "-" ++ day ++ " " ++ hour ++ ":" ++ minute ++ ":" ++ second

parseError :: Parser LineType
parseError = do
  string $ T.pack "ERROR - "
  timeStamp <- parseTimeStamp
  errorInfo <- parseAnyLine
  return $ SingleLineError timeStamp errorInfo

parseDirectoryInfo :: Parser LineType
parseDirectoryInfo = do
  char '/'
  directoryInfo <- parseAnyLine
  (return . DirectoryInfo) $ T.append (T.pack "/") directoryInfo

parseErrorInfo :: Parser LineType
parseErrorInfo = do
  errorInfo <- parseAnyLine
  (return . ErrorInfo) errorInfo

parseBackTraceString :: Parser LineType
parseBackTraceString = do
  let backTraceStr = T.pack " Backtrace: "
  string backTraceStr
  return $ BackTraceString backTraceStr

parseBacktraceLine :: Parser LineType
parseBacktraceLine = do
  char '#'
  number <- many1 digit
  backTraceInfo <- parseAnyLine
  let numberPart = T.pack $ '#' : number
  return $ LineBackTraceInfo $ T.append numberPart backTraceInfo

parseAnyLine :: Parser T.Text
parseAnyLine = fmap T.pack $ many anyChar

-- Skips n lines for allowing other parsers to succeed
skipNLines n = replicateM n $ manyTill anyChar endOfLine

-- performParser :: Parser a -> T.Text -> BackTraceInfo
performParser = parseOnly

getEitherRight :: Either a b -> b
getEitherRight (Right b) = b

parseLogFile :: [T.Text] -> [LineType]
parseLogFile textxs = 
  let listaEithers = mapM (parseOnly $
                           try parseError
                       <|> try parseDirectoryInfo
                       <|> try parseBacktraceLine
                       <|> try parseBackTraceString
                       <|> parseErrorInfo) textxs
  in getEitherRight listaEithers

customUnlines :: [String] -> String
customUnlines []     = []
customUnlines (x:xs) = if x == "\n"
                         then '\n':customUnlines xs
                         else x ++ "\n" ++ customUnlines xs

main = do
  (fileName : _) <- getArgs
  h <- SIO.openFile fileName SIO.ReadMode
  SIO.hSetEncoding h SIO.latin1
  fileContents <- SIO.hGetContents h
  let titleLength           = length fileName
      titleWithoutExtension = take (titleLength - 4) fileName
      allNonEmptyLines      = map T.pack $ intersperse "\n" $ tail $ filter (/= "") $ lines fileContents -- [T.Text]
      listParseResults      = parseLogFile allNonEmptyLines -- [LineType]
      -- onlyModelErrors       = filter isModelError parseResult -- [LogFileInfo]
      -- onlyOneRepresentative = map head $ groupBy equalErrors onlyModelErrors
      listOfStrings         = map show listParseResults
  writeFile (titleWithoutExtension ++ ".logsummary") $ customUnlines listOfStrings

The first problem is that the parser isn't parsing anything. And the second problem is that is using 16GB of RAM. How can improve my approach?

You should include your imports so we know what `SIO` and `T` refer to and which parser combinator library you are using. — ErikR, Nov 07 '15 at 14:32

ErikR · Accepted Answer · 2015-11-07T15:49:08.967

2

There are at least two issues - writeFile and customUnlines.

writeFile needs to collect all of the output before writing it, so I would first see if this produces output:

h <- openFile "summary.txt" WriteMode
forM_ listOfStrings (hPutStrLn h)
hClose h

This should process the log file in a streaming fashion if listOfStrings is a lazy list.

Assuming this works, to implement your customUnlines logic I would do this:

h <- openFile "summary.txt" WriteMode
forM_ listOfStrings $ \x -> do
  if x == "\n"
    then hPutStr h "\n"
    else hPutStrLn h "\n"
hClose h

If listOfStrings is not a lazy list then I need your imports to further debug the problem.

Update

It turns out that listOfStrings is not a lazy list due to parseLogFile.

Note that listaEithers has type Either String [LineType]. This means that you have to parse all of the lines before it returns. Instead, you should parse each line individually:

forM_ allNonEmptyLines $ \x -> do
  case parseOnly parseLogLine x of
    Left e -> error "oops"
    Right a -> print a       -- a is a LineType

Here parseLogLine is:

parseLogLine =
  try parseError
  <|> try parseDirectoryInfo
  <|> try parseBacktraceLine
  <|> try parseBackTraceString
  <|> parseErrorInfo

edited Nov 07 '15 at 15:49

answered Nov 07 '15 at 14:56

ErikR

51,541
9
73
124

One last question. When I try to store the results in a list, for example, it uses a lot of memory (more than ever): stringList <- forM allNonEmptyLines $ \x -> do case parseOnly parseLogLine x of Left e -> return $ show e Right a -> return $ show a. How can I store the parse results without using too much memory? – freinn Nov 07 '15 at 18:27
Why do you want to store the results? I estimate you are generating at least 100 million entries - of course it is going to require a lot of memory. What do you actually want to _do_ with the parse results? – ErikR Nov 07 '15 at 18:45
I want to classify them, for example delete the repeated ones (which only differ in its timestamp) and other actions like this. – freinn Nov 07 '15 at 18:51
That is complicated enough that you should ask that in new question. – ErikR Nov 07 '15 at 18:53

attoparsec high memory usage reading huge file

1 Answers1