Parsing multi line log with attoparsec

Question

I'm trying to parse a multiline log like this

[xxx] This is 1
[xxx] This is also 1
[yyy] This is 2

I have these types defined

{-# LANGUAGE OverloadedStrings #-}

module Parser where

import Prelude hiding(takeWhile)
import Data.Text
import Data.Word
import Data.Attoparsec.Text as T
import Data.Char
import Data.String

data ID    = ID String deriving (Eq, Show)
data Entry = Entry ID String deriving (Eq, Show)
data Block = Block ID [String]
data Log   = Log [Block]

And defined these parsers:

parseID :: Parser ID
parseID = do
  char '['
  id <- takeTill ( == ']' )
  char ']'
  return $ ID $ unpack id

parseEntry :: Parser Entry
parseEntry = do
  id <- parseID
  char ' '
  content <- takeTill isEndOfLine
  return $ Entry id (unpack content)

This works ok when I do stuff like parseOnly parseEntry entryString and I get back an Entry.

The problem is when I try to parse something like the log I added at the start. I would get a [Entry] but I would like to get [Block].

Also I want that when 2 or more consecutive lines have the same ID (like xxx) the should get stored into the same block, so for parsing the aforementioned log I'd like to get back

[block1, block2]
-- block1 == Block "xxx" ["This is 1", "This is also 1"]
-- block2 == Block "yyy" ["This is 2"]

How can I make the parser create new blocks or add into the last generated one depending on if the ID changes?

One obvious solution is to simply generate a [Entry] and then use a folding function to convert it to [Block] with the proper logic, but I'd be doing 2 passes, 1 over the log and another over the [Entry] which seems not only is not too performant for large logs but also feels like the wrong way to do it (from my very limited attoparsec knowledge)

Any other ideas?

EDIT

Bob Dalgleish solution essentially works (many thanks!!!), just needed a few tweaks to make it work. This is my final solution:

data ID    = ID String deriving (Eq, Show)
data Entry = Entry ID String deriving (Eq, Show)
data Block = Block ID [String] deriving (Eq, Show)
data Log   = Log [Block] deriving (Eq, Show)

parseID :: Parser ID
parseID = do
  char '['
  id <- takeTill ( == ']' )
  char ']'
  return $ ID $ unpack id

parseEntry :: Parser Entry
parseEntry = do
  id <- parseID
  char ' '
  content <- takeTill isEndOfLine
  return $ Entry id (unpack content)

parseEntryFor :: ID -> Parser Entry
parseEntryFor blockId = do
  id <- parseID
  if blockId == id
     then do
       char ' '
       content <- takeTill isEndOfLine
       endOfLine <|> endOfInput
       return $ Entry id (unpack content)
  else fail "nonmatching id"

parseBlock :: Parser Block
parseBlock = do
  (Entry entryId s) <- parseEntry
  let newBlock = Block entryId [s]
  endOfLine <|> endOfInput
  entries <- many' (parseEntryFor entryId)
  return $ Block entryId (s : Prelude.map (\(Entry _ s') -> s') entries)

The approach that I've taken in such situations is to introduce an intermediate parsing step that maintains state, namely a `Block` that might be merged with a new `Entry`. It would not emit a `Block` until it was sure that the `Block` could not be grown. The other approach is to use attoparsec's ability to do unbounded lookahead to determine if the following `Entry` has the same id. — Bob Dalgleish, Jan 31 '19 at 19:56

score 1 · Accepted Answer · answered Jan 31 '19 at 20:45

You need to have a parser for Blocks. It accepts an Entry, does a lookahead for an Entry with the same id; if not the same, it backtracks and returns what it has so far.

First, let's introduce a conditional Entry parser:

parseEntryFor :: ID -> Parser Entry
parseEntryFor blockId = do
  id <- parseEntry
  if blockId == id
  then do
         char ' '
         content <- takeTill isEndOfLine
         endOfLine
         return $ Entry id (unpack content)
  else fail "nonmatching id"

-- |A Block consists of one or more Entry's with the same ID
parseBlock :: Parser Block
parseBlock = do
  (Entry entryId s) <- parseEntry
  let newBlock = Block entryId [s]
  endOfLine
  entries <- many' (parseEntryFor entryId)
  return $ Block entryId s: (map (\(Entry _ s') -> x') entries)

(This code is not tested, as I have only ever used Parsec.)

Parsing multi line log with attoparsec

EDIT

1 Answers1