0

I need to parse and process a text file that is a nested list of integer. The file is about 250mb large. This already leads to performace problems my naive solution takes 20GB or more of RAM.

The question is related to another question.

I have written about the memory problems and the suggestion was to use Data.Vector to get rtid of the memory problems.

So the goal is to process a nested list of integers and, say, filter the values so that only values larger than 30 get printed out.

Test file "myfile.tx":

11,22,33,44,55
66,77,88,99,10

Here is my code using Attoparsec, adapted from attoparsec-csv:

    {-# Language OverloadedStrings #-}


-- adapted from https://github.com/robinbb/attoparsec-csv


module Text.ParseCSV
   ( 
   parseCSV
   ) where

import Prelude hiding (concat, takeWhile)
import Control.Applicative ((<$>), (<|>), (<*>), (<*), (*>), many)
import Control.Monad (void, liftM)
import Data.Attoparsec.Text
import qualified Data.Text as T (Text, concat, cons, append, pack, lines)
import qualified Data.Text.IO as IO (readFile, putStr)

import qualified Data.ByteString.Char8 as BSCH (readInteger)


lineEnd :: Parser ()
lineEnd =
   void (char '\n') <|> void (string "\r\n") <|> void (char '\r')
   <?> "end of line"

parserInt :: Parser Integer
parserInt = (signed decimal)

record :: Parser [Integer]
record =
   parserInt `sepBy1` char ','
   <?> "record"

file :: Parser [[Integer]]
file =
   (:) <$> record
       <*> manyTill (lineEnd *> record)
                    (endOfInput <|> lineEnd *> endOfInput)
   <?> "file"


parseCSV :: T.Text -> Either String [[Integer]]
parseCSV = 
   parseOnly file


getValues :: Either String [[Integer]] -> [Integer] 
getValues (Right [x]) = x
getValues _ = []


getLines :: FilePath -> IO [T.Text]
getLines = liftM T.lines . IO.readFile

parseAndFilter :: T.Text -> [Integer]
parseAndFilter = ((\x -> filter (>30) x) . getValues . parseCSV)

main = do
    list <- getLines "myfile.txt"
    putStr $ show $ map parseAndFilter list

But instead of using a list [Integer] I would like to use Data.Vector.

I found a relevant part in the Data.Vector tutorial:

--The simplest way to parse a file of Int or Integer types is with a strict or lazy --ByteString, and the readInt or readInteger functions:

{-# LANGUAGE BangPatterns #-}

import qualified Data.ByteString.Lazy.Char8 as L
import qualified Data.Vector                as U
import System.Environment

main = do
    [f] <- getArgs
    s   <- L.readFile f
    print . U.sum . parse $ s

-- Fill a new vector from a file containing a list of numbers.
parse = U.unfoldr step
  where
     step !s = case L.readInt s of
        Nothing       -> Nothing
        Just (!k, !t) -> Just (k, L.tail t)

However, this is regular, not a nested list of integers.

I tried to adapt my code but it did not work.

How can I change my code to use a nested Vector (or Vector of Vectors) instead of [Integer] (i.e., while also running the Filter of >30 on the Vector).

Community
  • 1
  • 1
mrsteve
  • 4,082
  • 1
  • 26
  • 63

2 Answers2

2

There is an important question you don't mention in the posting.... Do you need everything in memory at once. If the processing is local, or if you can summarize all the data up to a point in the file with a few values, you can solve the performance problems by streaming the data through and throwing away all but the current line. This will usually run way faster and allow you to process orders of magnitude larger files. And it usually doesn't even matter (as much) what data structure you use to parse the values.

Here is an example:

import Text.Regex

process::[Int]->String
process = (++"\n") . show . sum --put whatever you want here.

main = interact (concat . map (process . map read . splitRegex (mkRegex ",")) . lines)

The whole program runs lazily, so it processes line by line as the data comes in and frees up the memory for old data (you can check this by typing in data by hand and watch the output come out). There is a performance hit by using the unpacked structures, but this isn't as big a problem as pulling everything into memory.

Many problems that don't seem to fit this criteria at first can be modified to do so (you may have to sort the data first, but there are many performance effective ways to do this).... I rewrote the full online stats system for a gaming company once following this principle, and was able to take a stats crunching time from hours to a couple of minutes (with even more metrics).

Because of its lazy nature, Haskell is a good language to stream data through.

jamshidh
  • 12,002
  • 17
  • 31
  • This is an excellent answer, I learned a lot. But the example runs 10 times slower (probably not meant to be fast). I just adapted my code. Thanks again! – mrsteve Dec 05 '13 at 03:14
  • Thanks! By the way, if you want both to keep memory at bay, and also get raw low level speed, check out the standard unix tools, "sort", "uniq", "awk" and "perl" (for one line regex replacement). It isn't as structured as Haskell, SQL, etc, but it blows everything else away for speed (it is even comparable with c). – jamshidh Dec 05 '13 at 03:20
  • I am just posting my answer, where I change the code to vector and the result is fast and memory efficient. I don't know about awk and perl, but there is some indication that at least grep is not so fast. – mrsteve Dec 05 '13 at 03:22
0

I found a post that there is no easy way to parse with attoparsec to a vector.

See this forum post and thread.

But the good new is that the overhead of Data.Vector.fromList isn't so bad.

Attoparsec seems to be quite fast for parsing.

I keep the whole data in memory and this doesn't seem a speed overhead. It's more flexible, as perhaps later I need to have the whole data in memory, altough currently it is not needed per se for my problem.

Currently the code runs in ~30 seconds and about 1.5GB RAM for a 150MB text file. Now the memory consumption is quite little versus 20GB of before and I only need to focus on improving the speed.

Here are the changes from the code of my question my post, commented out code is using lists, functions with Vector in the type are new (this is not production code or meant to be good code yet):

{-
getValues :: Either String [[Integer]] -> [Integer] 
getValues (Right [x]) = x
getValues _ = []
-}

getValues :: Either String [[Integer]] -> Vector Integer
getValues (Right [x]) = V.fromList x
getValues _ = V.fromList [999999,9999999,99999,999999] --- represents an ERROR


getLines :: FilePath -> IO [T.Text]
getLines = liftM T.lines . IO.readFile

{-
parseAndFilter :: T.Text -> [Integer]
parseAndFilter = ((\x -> filter (>30) x) . getValues . parseCSV)
-}

filterLarger :: Vector Integer -> Vector Integer
filterLarger = \x -> V.filter (>37) x

parseVector :: T.Text -> Vector Integer
parseVector = (getValues . parseCSV)


-- mystr = T.pack "3, 6, 7" --, 13, 14, 15, 17, 21, 22, 23, 24, 25, 28, 29, 30, 32, 33, 35, 36"

main = do
    list <- getLines "mydata.txt"
    --putStr $ show $ parseCSV $ mystr  
    putStr $ show $ V.map filterLarger $ V.map parseVector $ V.fromList list



--show $ parseOnly parserInt $ T.pack "123"

Thanks to jamshidh and all the comments that pointed me to the right direction.

Here is the final solution. Switching to ByteString and Int in the code, it now runs twice as fast and a bit less memory consumtion (time is now ~14 Seconds).

{-# Language OverloadedStrings #-}


-- adapted from https://github.com/robinbb/attoparsec-csv

module Main
   ( 
   parseCSV, main
   ) where

import Data.Vector as V (Vector, fromList, map, head, filter)

import Prelude hiding (concat, takeWhile)
import Control.Applicative ((<$>), (<|>), (<*>), (<*), (*>), many)
import Control.Monad (void, liftM)


import Data.Attoparsec.Char8 

import qualified Data.ByteString.Char8 as B


lineEnd :: Parser ()
lineEnd =
   void (char '\n') <|> void (string "\r\n") <|> void (char '\r')
   <?> "end of line"

parserInt :: Parser Int
parserInt = skipSpace *> signed decimal


record :: Parser [Int]
record =
   parserInt `sepBy1` char ','
   <?> "record"

file :: Parser [[Int]]
file =
   (:) <$> record
       <*> manyTill (lineEnd *> record)
                    (endOfInput <|> lineEnd *> endOfInput)
   <?> "file"


parseCSV :: B.ByteString -> Either String [[Int]]
parseCSV = 
   parseOnly file


getValues :: Either String [[Int]] -> Vector Int
getValues (Right [x]) = V.fromList x
getValues _ = error "ERROR in getValues function!"



filterLarger :: Vector Int -> Vector Int
filterLarger = \x -> V.filter (>36) x


parseVector :: B.ByteString -> Vector Int
parseVector = (getValues . parseCSV)


-- MAIN
main = do

    fContent <- B.readFile "myfile.txt"
    putStr $ show $ V.map filterLarger $ V.map parseVector $ V.fromList $ B.lines fContent
mrsteve
  • 4,082
  • 1
  • 26
  • 63