1

I'm trying to parse the following alternate strings with nom5.0

"A-Za-z0-9"

or

"A-Z|a-z|0-9"

I've tried the following but to no avail

pub enum Node {
    Range(Vec<u8>),
}

fn compound_range(input: &[u8]) -> IResult<&[u8], Node> {
    map(
        separated_list(
            alt((tag("|"), tag(""))),
            tuple((take(1usize), tag("-"), take(1usize))),
        ),
        |v: Vec<(&[u8], _, &[u8])>| {
            Node::Range(
                v.iter()
                    .map(|(s, _, e)| (s[0]..=e[0]).collect::<Vec<_>>())
                    .flatten()
                    .collect(),
            )
        },
    )(input)
}

Version 2.

fn compound_range(input: &[u8]) -> IResult<&[u8], Node> {
        alt((
            map(
                separated_list(tag("|"), tuple((take(1usize), tag("-"), take(1usize)))),
                |v: Vec<(&[u8], _, &[u8])>| {
                    Node::Range(
                        v.iter()
                            .map(|(s, _, e)| (s[0]..=e[0]).collect::<Vec<_>>())
                            .flatten()
                            .collect(),
                    )
                },
            ),
            map(
                many1(tuple((take(1usize), tag("-"), take(1usize)))),
                |v: Vec<(&[u8], _, &[u8])>| {
                    Node::Range(
                        v.iter()
                            .map(|(s, _, e)| (s[0]..=e[0]).collect::<Vec<_>>())
                            .flatten()
                            .collect(),
                    )
                },
            ),
        ))(input)
    }



#[test]
fn parse_compound() {
    println!("{:?}", compound_range(b"A-Za-z0-9"));
    println!("{:?}", compound_range(b"A-Z|a-z|0-9"));
}

I can either get the first or the second one to parse but never both. Is there a way to express this?

Delta_Fore
  • 3,079
  • 4
  • 26
  • 46

1 Answers1

2

The problem is that nom always takes the first path it sees somewhat works (as in, it doesn't have to consume all input). So what you ideally want to do, is split the paths after the first "a-z" (or whatever), to one of two possible ones: You deal with | as a separator, or not.

This is because nom is a parser combinator library, and doesn't work like regex which can backtrack as far as it needs to to find something that works.

Anyway, something like that should work:

fn compound_range(input: &[u8]) -> IResult<&[u8], Node> {
    let single_range = |input| map(
        separated_pair(take(1usize), tag("-"), take(1usize)),
        |(l, r): (&[u8], &[u8])| (l[0], r[0])
    )(input);

    map(
        opt(
            map(
                pair(
                    single_range,
                    alt((
                        preceded(tag("|"), separated_nonempty_list(
                            tag("|"),
                            single_range,
                        )),
                        many0(single_range)
                    ))
                ),
                |(first, rest)| Node::Range(
                    std::iter::once(first).chain(rest).flat_map(|(l, r)| l..r).collect()
                )
            ),
        ),
        |o| o.unwrap_or_else(|| Node::Range(Vec::new()))
    )(input)
}

Is there a better way? Probably. Given the specific task, it might actually make sense to implement that part of the parser you're writing manually. Does it work this way though? Probably. (I haven't tested it)

Also something to keep in mind: This might consume too much, if you expect some other stuff that fits the pattern after it.

CodenameLambda
  • 1,486
  • 11
  • 23
  • That solves the task at hand, but you are probably right to manually rewrite might be a better option for this – Delta_Fore Aug 16 '19 at 17:27
  • @Delta_Fore Specifically, upon thinking about it, it's probably a good idea to just write your own combinator by hand for allowing different separators. If you need help with that, just tell me. I'll check SO later today again. – CodenameLambda Aug 16 '19 at 17:29