3

So ... I messed up a recording in CSV format:

23,95489,0,20,9888

Due to language settings floating point numbers were written with commas as seperator ... in a comma separated value file ...

Problem is that the file does not have a nice formatting for every float. Some have no point at all and the number of numbers behind the point varies too.

My idea was to build a MegaParsec parser that would try to read every possible floating point formatting, move on and if back track if it finds an error.

Eg for the example above:

  1. read 23,95489 -> good
  2. read 0,20 -> good (so far)
  3. read 9888 -> error (because value is too high for column (checked by guard))
  4. (back tracking to 2.) read 0 -> good again
  5. read 20,9888 -> good
  6. done

I've implemented that as (pseudo code here):

floatP = try pointyFloatP <|> unpointyFloatP

lineP = (,,) <$> floatP <* comma <*> floatP <* comma <*> floatP <* comma

My problem is that apparently the try only works in the 'current' float. There is no backtracking to previous positions. Is this correct?

And if so ... how would I go about implementing further back tracking?

fho
  • 6,787
  • 26
  • 71
  • Please show the actual code you've tried this with. – leftaroundabout Jul 23 '18 at 16:33
  • 1
    If you want to be able to backtrack an arbitrary distance, you are creating a parser with O(2^N) runtime. Each comma in the source represents a choice (decimal point or field separator), and your parser's task is to search all possible choices for one that doesn't cause a parse failure. – amalloy Jul 23 '18 at 17:27

2 Answers2

3

How far does “try” back track?

The parser try p consumes exactly as much input as p if p parses successfully, otherwise it does not consume any input at all. So if you look at that in terms of backtracking, it backtracks to the point where you were when you invoked it.

My problem is that apparently the try only works in the 'current' float. There is no backtracking to previous positions. Is this correct?

Yes, try does not "unconsume" input. All it does is to recover from a failure in the parser you give it without consuming any input. It does not undo the effects of any parsers that you've applied previously, nor does it affect subsequent parsers that you apply after try p succeeded.

And if so ... how would I go about implementing further back tracking?

Basically what you want is to not only know whether pointyFloatP succeeds on the current input, but also whether the rest of your lineP would succeed after successfully pointyFloatP - and if it doesn't you want to backtrack back to before you applied pointyFloatP. So basically you want the parser for the whole remaining line in the try, not just the float parser.

To achieve that you can make floatP take the parser for the remaining line as an argument like this:

floatP restP = try (pointyFloatP <*> restP) <|> unpointyFloatP <*> restP

Note that this kind of backtracking isn't going to be very efficient (but I assume you knew that going in).

sepp2k
  • 363,768
  • 54
  • 674
  • 675
  • I don't even think `try` is enough. Doesn't this application require ambiguity detection? If there's more than one way to parse the file, that's pretty important to know! – dfeuer Jul 23 '18 at 19:48
  • @dfeuer From OP's description of the steps, I didn't get the impression that any ambiguity detection was supposed to take place. But you're right, that that's pretty dangerous. – sepp2k Jul 23 '18 at 20:09
  • @dfeuer I am hoping that the whole thing is unambiguous by adding range constraints. Not sure thou if that is really the case. – fho Jul 24 '18 at 07:52
2

Update: Include a custom monadic parser for more complex rows.

Using the List Monad for Simple Parsing

The list monad makes a better backtracking "parser" than Megaparsec. For example, to parse the cells:

row :: [String]
row = ["23", "95489", "0", "20", "9888"]

into exactly three columns of values satisfying a particular bound (e.g., less than 30), you can generate all possible parses with:

{-# OPTIONS_GHC -Wall #-}

import Control.Monad
import Control.Applicative

rowResults :: [String] -> [[Double]]
rowResults = cols 3
  where cols :: Int -> [String] -> [[Double]]

        cols 0 [] = pure []   -- good, finished on time
        cols 0 _  = empty     -- bad, didn't use all the data

        -- otherwise, parse exactly @n@ columns from cells @xs@
        cols n xs = do
          -- form @d@ from one or two cells
          (d, ys) <- num1 xs <|> num2 xs
          -- only accept @d < 30@
          guard $ d < 30
          ds <- cols (n-1) ys
          return $ d : ds

        -- read number from a single cell
        num1 (x:xs) | ok1 x = pure (read x, xs)
        num1 _ = empty

        -- read number from two cells
        num2 (x:y:zs) | ok1 x && ok2 y = pure (read (x ++ "." ++ y), zs)
        num2 _ = empty

        -- first cell: "0" is okay, but otherwise can't start with "0"
        ok1 "0" = True
        ok1 (c:_) | c /= '0' = True
        ok1 _ = False

        -- second cell: can't end with "0" (or *be* "0")
        ok2 xs = last xs /= '0'

The above list-based parser tries to reduce ambiguity by assuming that if "xxx,yyy" is a number, the "xxx" won't start with zeros (unless it's just "0"), and the "yyy" won't end with a zero (or, for that matter, be a single "0"). If this isn't right, just modify ok1 and ok2 as appropriate.

Applied to row, this gives the single unambiguous parse:

> rowResults row
[[23.95489,0.0,20.9888]]

Applied to an ambiguous row, it gives all parses:

> rowResults ["0", "12", "5", "0", "8601"]
[[0.0,12.5,0.8601],[0.0,12.5,0.8601],[0.12,5.0,0.8601]]

Anyway, I'd suggest using a standard CSV parser to parse your file into a matrix of String cells like so:

dat :: [[String]]
dat = [ ["23", "95489", "0", "20", "9888"]
      , ["0", "12", "5", "0", "8601"]
      , ["23", "2611", "2", "233", "14", "422"]
      ]

and then use rowResults above get the row numbers of rows that were ambiguous:

> map fst . filter ((>1) . snd) . zip [1..] . map (length . rowResults) $ dat
[2]
>

or unparsable:

> map fst . filter ((==0) . snd) . zip [1..] . map (length . rowResults) $ dat
[]
>

Assuming there are no unparsable rows, you can regenerate one possible fixed file, even if some rows are ambiguous, but just grabbing the first successful parse for each row:

> putStr $ unlines . map (intercalate "," . map show . head . rowResults) $ dat
23.95489,0.0,20.9888
0.0,12.5,0.8601
23.2611,2.233,14.422
>

Using a Custom Monad based on the List Monad for More Complex Parsing

For more complex parsing, for example if you wanted to parse a row like:

type Stream = [String]
row0 :: Stream
row0 = ["Apple", "15", "1", "5016", "2", "5", "3", "1801", "11/13/2018", "X101"]

with a mixture of strings and numbers, it's actually not that difficult to write a monadic parser, based on the list monad, that generates all possible parses.

The key idea is to define a parser as a function that takes a stream and generates a list of possible parses, with each possible parse represented as a tuple of the object successfully parsed from the beginning of the stream paired with the remainder of the stream. Wrapped in a newtype, our parallel parser would look like:

newtype PParser a = PParser (Stream -> [(a, Stream)]) deriving (Functor)

Note the similarity to the type ReadS from Text.ParserCombinators.ReadP, which is also technically an "all possible parses" parser (though you usually only expect one, unambiguous parse back from a reads call):

type ReadS a = String -> [(a, String)]

Anyway, we can define a Monad instance for PParser like so:

instance Applicative PParser where
  pure x = PParser (\s -> [(x, s)])
  (<*>) = ap
instance Monad PParser where
  PParser p >>= f = PParser $ \s1 -> do  -- in list monad
    (x, s2) <- p s1
    let PParser q = f x
    (y, s3) <- q s2
    return (y, s3)

There's nothing too tricky here: pure x returns a single possible parse, namely the result x with an unchanged stream s, while p >>= f applies the first parser p to generate a list of possible parses, takes them one by one within the list monad to calculate the next parser q to use (which, as per usual for a monadic operation, can depend on the result of the first parse), and generates a list of possible final parses that are returned.

The Alternative and MonadPlus instances are pretty straightforward -- they just lift emptiness and alternation from the list monad:

instance Alternative PParser where
  empty = PParser (const empty)
  PParser p <|> PParser q = PParser $ \s -> p s <|> q s
instance MonadPlus PParser where

To run our parser, we have:

parse :: PParser a -> Stream -> [a]
parse (PParser p) s = map fst (p s)

and now we can introduce primitives:

-- read a token as-is
token :: PParser String
token = PParser $ \s -> case s of
  (x:xs) -> pure (x, xs)
  _ -> empty

-- require an end of stream
eof :: PParser ()
eof = PParser $ \s -> case s of
  [] -> pure ((), s)
  _ -> empty

and combinators:

-- combinator to convert a String to any readable type
convert :: (Read a) => PParser String -> PParser a
convert (PParser p) = PParser $ \s1 -> do
  (x, s2) <- p s1     -- for each possible String
  (y, "") <- reads x  -- get each possible full read
                      -- (normally only one)
  return (y, s2)

and parsers for various "terms" in our CSV row:

-- read a string from a single cell
str :: PParser String
str = token

-- read an integer (any size) from a single cell
int :: PParser Int
int = convert (mfilter ok1 token)

-- read a double from one or two cells
dbl :: PParser Double
dbl = dbl1 <|> dbl2
  where dbl1 = convert (mfilter ok1 token)
        dbl2 = convert $ do
          t1 <- mfilter ok1 token
          t2 <- mfilter ok2 token
          return $ t1 ++ "." ++ t2

-- read a double that's < 30
dbl30 :: PParser Double
dbl30 = do
  x <- dbl
  guard $ x < 30
  return x

-- rules for first cell of numbers:
-- "0" is okay, but otherwise can't start with "0"
ok1 :: String -> Bool
ok1 "0" = True
ok1 (c:_) | c /= '0' = True
ok1 _ = False

-- rules for second cell of numbers:
-- can't be "0" or end in "0"
ok2 :: String -> Bool
ok2 xs = last xs /= '0'

Then, for a particular row schema, we can write a row parser as we normally would with a monadic parser:

-- a row
data Row = Row String Int Double Double Double
               Int String String deriving (Show)
rowResults :: PParser Row
rowResults = Row <$> str <*> int <*> dbl30 <*> dbl30 <*> dbl30
                 <*> int <*> str <*> str <* eof

and get all possible parses:

> parse rowResults row0
[Row "Apple" 15 1.5016 2.0 5.3 1801 "11/13/2018" "X101"
,Row "Apple" 15 1.5016 2.5 3.0 1801 "11/13/2018" "X101"]
>

The full program is:

{-# LANGUAGE DeriveFunctor #-}
{-# OPTIONS_GHC -Wall #-}

import Control.Monad
import Control.Applicative

type Stream = [String]

newtype PParser a = PParser (Stream -> [(a, Stream)]) deriving (Functor)
instance Applicative PParser where
  pure x = PParser (\s -> [(x, s)])
  (<*>) = ap
instance Monad PParser where
  PParser p >>= f = PParser $ \s1 -> do  -- in list monad
    (x, s2) <- p s1
    let PParser q = f x
    (y, s3) <- q s2
    return (y, s3)
instance Alternative PParser where
  empty = PParser (const empty)
  PParser p <|> PParser q = PParser $ \s -> p s <|> q s
instance MonadPlus PParser where

parse :: PParser a -> Stream -> [a]
parse (PParser p) s = map fst (p s)

-- read a token as-is
token :: PParser String
token = PParser $ \s -> case s of
  (x:xs) -> pure (x, xs)
  _ -> empty

-- require an end of stream
eof :: PParser ()
eof = PParser $ \s -> case s of
  [] -> pure ((), s)
  _ -> empty

-- combinator to convert a String to any readable type
convert :: (Read a) => PParser String -> PParser a
convert (PParser p) = PParser $ \s1 -> do
  (x, s2) <- p s1     -- for each possible String
  (y, "") <- reads x  -- get each possible full read
                      -- (normally only one)
  return (y, s2)

-- read a string from a single cell
str :: PParser String
str = token

-- read an integer (any size) from a single cell
int :: PParser Int
int = convert (mfilter ok1 token)

-- read a double from one or two cells
dbl :: PParser Double
dbl = dbl1 <|> dbl2
  where dbl1 = convert (mfilter ok1 token)
        dbl2 = convert $ do
          t1 <- mfilter ok1 token
          t2 <- mfilter ok2 token
          return $ t1 ++ "." ++ t2

-- read a double that's < 30
dbl30 :: PParser Double
dbl30 = do
  x <- dbl
  guard $ x < 30
  return x

-- rules for first cell of numbers:
-- "0" is okay, but otherwise can't start with "0"
ok1 :: String -> Bool
ok1 "0" = True
ok1 (c:_) | c /= '0' = True
ok1 _ = False

-- rules for second cell of numbers:
-- can't be "0" or end in "0"
ok2 :: String -> Bool
ok2 xs = last xs /= '0'

-- a row
data Row = Row String Int Double Double Double
               Int String String deriving (Show)
rowResults :: PParser Row
rowResults = Row <$> str <*> int <*> dbl30 <*> dbl30 <*> dbl30
                 <*> int <*> str <*> str <* eof

row0 :: Stream
row0 = ["Apple", "15", "1", "5016", "2", "5", "3", "1801", "11/13/2018", "X101"]

main = print $ parse rowResults row0

Off-the-shelf Solutions

I find it a little surprising I can't find an existing parser library out there that provides this kind of "all possible parses" parser. The stuff in Text.ParserCombinators.ReadP takes the right approach, but it assumes that you're parsing characters from a String rather than arbitrary tokens from some other stream (in our case, Strings from a [String]).

Maybe someone else can point out an off-the-shelf solution that would save you from having to role your own parser type, instances, and primitives.

K. A. Buhr
  • 45,621
  • 3
  • 45
  • 71
  • I guess I should have noted that the values are not only doubles. But yes I am looking for a "try all combinations" method. – fho Jul 24 '18 at 07:52
  • I added an example of a custom monadic parser for "all possible parses". This might help get you started. – K. A. Buhr Jul 24 '18 at 16:29