Parser: Enforcing unicity when order is not known

Question

Using Attoparsec, I am trying to match strings containing exactly 1 'x', 1 'y', and 1 'z', and any number of 'a', 'b', or 'c', without any constraint on the order of each char.

For instance, "abbbzacyaaaxcba" and "abbbzacxaaaycba" should be a match but the following should not :

"abbbzacyaaacba" (reason: no 'x')
"abbbzacyaaaxcbxa" (reason: duplicate 'x')

The best I've been able to do so far is this:

import qualified Data.Attoparsec.ByteString.Char8 as A8
import qualified Data.ByteString.Char8 as B8 (pack)

p ch = do
    abcs <- A8.many' (A8.choice [A8.char 'a', A8.char 'b', A8.char 'c'])
    x    <- A8.char ch
    return $ concat [[x],abcs]

parse = do
    xyz1 <- A8.choice [p 'x', p 'y', p 'z']
    xyz2 <- A8.choice [p 'x', p 'y', p 'z']
    xyz3 <- A8.choice [p 'x', p 'y', p 'z']
    final <- A8.manyTill (A8.choice [A8.char 'a', A8.char 'b', A8.char 'c']) $ A8.char '\n'
    return (xyz1, xyz2, xyz3, final)

(arbitrarily, I chose to stop the parsing with '\n', but that's just to pick a simple example).

then trying in ghci:

Prelude> A8.parseTest parse $ B8.pack "abbbzacyaaaxcba\n"
Done "" ("zabbb","yac","xaaa", "cba")
Prelude> A8.parseTest parse $ B8.pack "abbbzacyaaacba\n"
Fail "aaacba\n" [] "Failed reading: empty"
Prelude> A8.parseTest parse $ B8.pack "abbbzacyaaaxcbxa\n"
Fail "xa\n" [] "Failed reading: empty"

But it looks very clunky, and it does not scale easily to a list of unique characters (eg I am given a givenchars :: [Char] list without duplicates and I want to match any string made of all the givenchars and of any 'a','b','c's in between, in any order).

Is there a better, more elegant, and scalable way to do this?

PS: I am not looking for a regex solution as it would not apply to my real-life problem. I need to use a parser.

Using a parser is a *solution* to a problem; what is the *actual problem*? — chepner, Sep 14 '16 at 19:03

score 2 · Accepted Answer · answered Sep 14 '16 at 16:11

There are a couple of issues with your code:

Firstly, it doesn't enforce exactly one x, y and z in the string. For instance, it will accept xxx\n

Secondly, it's very inefficient. Consider how xyz1 is parsed when given the string aaaaaaaaaz:

We first try p 'x'. This starts by parsing all of the a characters and then finds a z. Since z is not x, the entire p 'x' parser fails.
Then we try p 'y'. This re-parses all of the a characters and then finds the z again. Since z is not y the entire p 'y' parser fails.
On the third try we succeed, only after re-parsing all of the a characters for the third time.

It would be better to write something like:

parse = do
   s <- A8.takeWhile (\x -> elem x "abcxyz")
   let xs = count 'x' s
       ys = count 'y' s
       zs = count 'z' s
   guard $ xs == 1 && ys == 1 && zs == 1
   return s

The function count is from Data.ByteString.Char8 and guard is from Control.Monad

Why not use something at least vaguely like `MaybeT (StateT (Bool, Bool, Bool) Parser)` to interleave counting with parsing and allow an early failure if too many are seen? — dfeuer, Sep 14 '16 at 16:25
You can do that easily with parsec, but as I recall implementing stateful parses is not that easy with attoparsec. And even if you did I'm sure this approach would be more performant. Of course, performance for any implementation will depend on what kind of bad inputs you expect. — ErikR, Sep 14 '16 at 16:32

Parser: Enforcing unicity when order is not known

1 Answers1