1

I'm using pipes, attoparsec, and pipes-attoparsec to write a database dump file converter. The general format of the file is to have a create table command followed by an optional insert command. In addition to transforming the statements in place, the table definitions have to be held in memory until the very end for additional processing (indexes, constraints, etc.).

This works fine, but now I need to allow some of my internal parsers to have access to my Producer's State in order to determine which parser needs to be run while processing the values from the insert command.

I tried something like this:

-- IO
import qualified Data.ByteString.Char8 as BS (putStrLn)
import System.Exit (ExitCode (..), exitSuccess, exitFailure)
import System.IO (hPutStrLn, stderr)

-- Pipes
import Pipes (runEffect, for, liftIO, Producer, Effect)
import Pipes.Attoparsec (parsed, ParsingError)
import Pipes.Lift (runStateP)
import Pipes.Safe (runSafeT)
import qualified Pipes.ByteString as PBS (stdin)

-- State
import Control.Monad.Trans.Class (lift)
import Control.Monad.Trans.State.Strict

dump' :: StateT ParserState Parser Command
dump' = fmap Create createStatements' <|> fmap Insert justData'

doStuff :: MonadIO m => Effect m (Either (ParsingError, Producer ByteString (StateT ParserState m) ()) (), ParserState)
doStuff = runStateP defaultParserState theStuff

theStuff :: MonadIO m => Effect (StateT ParserState m) (Either (ParsingError, Producer ByteString (StateT ParserState m) ()) ())
theStuff = for runParser (liftIO . BS.putStrLn <=< lift . processCommand)

runParser :: MonadIO m => Producer Command (StateT ParserState m) (Either (ParsingError, Producer ByteString (StateT ParserState m) ()) ())
runParser = do
    s <- lift get
    liftIO $ putStrLn "runParser"
    liftIO $ putStrLn $ show s
    parsed (evalStateT dump' s) PBS.stdin

processCommand :: MonadIO m => Command -> StateT ParserState m ByteString
processCommand (Create xs) = do
    currentState <- get
    liftIO $ putStrLn "processCommand"
    liftIO $ putStrLn $ show currentState
    _ <- put (currentState { constructs = xs ++ (constructs currentState)})
    return $ P.firstPass $ P.transformConstructs xs
processCommand (Insert x) = return x

Complete source (including parsers): https://github.com/cimmanon/mysqlnothx/blob/parser-state/src/Main.hs

When I run it, I get a result that looks something like this:

runParser
ParserState {constructs = []}
processCommand
ParserState {constructs = []}
processCommand
ParserState {constructs = [ ... ]}
processCommand
ParserState {constructs = [ ..... ]}

I was expecting runParser (which would grab the latest contents from State) to be run every time processCommand runs, but that's clearly not the case based on the output. When I check the contents of State within the parser, it's always empty no matter how many commands are parsed.

How can I extend State from my Producers to my Parser (dump') so that they share the same State? If my Producer has 4 values in State, the Parser should also see those same 4 values.

cimmanon
  • 67,211
  • 17
  • 165
  • 171
  • From where are you getting your `Parser` type? – danidiaz May 01 '17 at 14:29
  • @danidiaz For [`pipes-autoparsec`](https://hackage.haskell.org/package/pipes-attoparsec-0.5.1.5/docs/Pipes-Attoparsec.html) it needs to be an autoparsec `ByteString` [`Parser`](https://hackage.haskell.org/package/attoparsec-0.13.1.0/docs/Data-Attoparsec-ByteString.html#t:Parser). I figured out which one by the argument to the `Producer` returned in the error of `runParser`. – Cirdec May 01 '17 at 14:33
  • @danidiaz It comes from attoparsec (Data.Attoparsec.ByteString). – cimmanon May 01 '17 at 15:58

1 Answers1

0

I was expecting runParser (which would grab the latest contents from State) to be run every time processCommand runs, but that's clearly not the case.

Your main effect is for runParser (liftIO . BS.putStrLn <=< lift . processCommand). To understand what this effect does you need to understand what for does:

(for p body) loops over p replacing each yield with body

"Loops over p" is accurate if a bit confusing. It doesn't run p once for each value produced by p; that would explode! Instead for replaces every yield in p with body. By replacing yield with body it runs body once for every yielded value. Running the body once for each produced value is similar to how in other languages a for-loop over a list runs the body once for each value in the list.

Your runParser is

runParser = do
    s <- lift get
    liftIO $ putStrLn "runParser"
    liftIO $ putStrLn $ show s
    parsed (evalStateT dump' s) PBS.stdin

It reads the state, outputs it, and produces the Commands parsed from stdin. Pipes-autoparsec's parsed parses the source and yields once for each completely successfully parsed value. Your for then replaces each of parsed's yields with liftIO . BS.putStrLn <=< lift . processCommand. The complete effect runs runParser once and processCommand once for each yield, which is what you're observing in the output.

Cirdec
  • 24,019
  • 2
  • 50
  • 100
  • That does explain the behavior I'm seeing, but is it possible to achieve the behavior I'm looking for? I'm prepared to rewrite my parsers to work differently if I have to, but I'm hoping to save myself several hours of work. – cimmanon May 01 '17 at 16:02
  • @cimmanon I'm not sure from your question what the behaviour you're looking for is. Perhaps it'd be easier to ask as two questions, "why does my code do this"?, focusing on the description of your existing code - like this one does, and "how can I make it do X"? focusing on the description of X that you want to accomplish. – Cirdec May 01 '17 at 16:07
  • @cimmanon But here's a hint: if you want to do something else at the end after all of the `Command`s have been parsed, try adding some lines to the end of `runParser` (like `s' <- lift get; liftIO $ putStrLn "runParser end"; liftIO $ putStrLn $ show s'`) and see if that's what you want. Or at the end of `theStuff` or `doStuff` if that makes more sense. – Cirdec May 01 '17 at 16:11
  • It's a parser downstream on the Insert branch that needs access to the table definitions parsed so far so that I can run the correct parser for each column type (it's pretty much guaranteed to be create table, then insert for that table, but other create statements that I'm capturing can appear in between like indexes). The DB that generates the dumps I'm parsing produces invalid dates and binary types are in the wrong format... and they look indistinguishable from an ordinary quoted string. – cimmanon May 01 '17 at 16:52
  • If you couldn't tell if I was asking "Why does X do this?" or "How do I do X?", you should have asked for clarification rather than post an answer. Because my question was never "Why does X do this?", I could already infer the behavior based on the output. – cimmanon May 01 '17 at 17:07