How can I parse this program correctly using Rust's parser library nom

Question

I want to parse a program represented by this BNF:

<log_and_expression> := <and_expression> (<space>* "&&" <space>* <and_expression>)*
<and_expression>     := <unary_expression> (<space>* "&" <space>* <unary_expression>)*
<unary_expression>   := ("&" <space>*)* <identifier>
<identifier>         := "a" | "b" | "c" | ...
<space>              := " " | "\t" | "\n" | "\r"

This BNF parses small program. log_and means Logical And.

and I wrote Rust program using parser library nom:

use nom::{
    branch::{alt, permutation},
    bytes::complete::{escaped_transform, is_not, tag, take_while_m_n},
    character::complete::{
        char, digit1, hex_digit1, line_ending, multispace0, multispace1, none_of, space0, space1,
    },
    combinator::{all_consuming, map, opt, value},
    error::VerboseError,
    multi::{many0, many1},
    number::complete::double,
    sequence::{delimited, tuple},
    IResult,
};

#[derive(Debug, PartialEq, Clone)]
pub struct LogAndExpression {
    pub and_expression: AndExpression,
    pub tail: Vec<AndExpression>,
}
#[derive(Debug, PartialEq, Clone)]
pub struct AndExpression {
    pub unary_expression: UnaryExpression,
    pub tail: Vec<UnaryExpression>,
}
#[derive(Debug, PartialEq, Clone)]
pub struct UnaryExpression {
    pub unary_operators: Vec<String>,
    pub identifier: String,
}

pub fn log_and_expression(s: &str) -> IResult<&str, LogAndExpression> {
    map(
        tuple((
            and_expression,
            many0(
                map(
                    tuple((
                        multispace0,
                        tag("&&"),
                        multispace0,
                        and_expression,
                    )),
                    |(_, _, _, expression)| expression,
                )
            ),
        )),
        |(expression, vec)| LogAndExpression {
            and_expression: expression,
            tail: vec,
        },
    )(s)
}

pub fn and_expression(s: &str) -> IResult<&str, AndExpression> {
    map(
        tuple((
            unary_expression,
            many0(
                map(
                    tuple((
                        multispace0,
                        tag("&"),
                        multispace0,
                        unary_expression,
                    )),
                    |(_, _, _, expression)| expression,
                )
            ),
        )),
        |(expression, vec)| AndExpression {
            unary_expression: expression,
            tail: vec,
        },
    )(s)
}

pub fn unary_expression(s: &str) -> IResult<&str, UnaryExpression> {
    map(
        tuple((
            many0(
                map(
                    tuple((
                        tag("&"),
                        multispace0
                    )),
                    |(opr, _): (&str, &str)| opr.to_owned(),
                )
            ),
            is_not(" &"),
        )),
        |(unary_operators, identifier)| UnaryExpression {
            unary_operators: unary_operators,
            identifier: identifier.to_owned(),
        },
    )(s)
}

fn main() {
    println!("{:?}", log_and_expression("a && b"));
}

This code outputs the result of parse a && b. log_and_expression parses program.

The result I want is a LogAnd b. But result is:

Ok(("", LogAndExpression { and_expression: AndExpression { unary_expression: UnaryExpression { unary_operators: [], identifier: "a" }, tail: [UnaryExpression { unary_operators: ["&"], identifier: "b" }] }, tail: [] }))

this means a And &b.

How can I parse this program correctly using nom?

Looks to me like a precedence issue, have tried to change the order of your `log_and_expression` rule with your `and_expression` rule? — Cryptjar, Jun 15 '21 at 23:01
How can I change the order? Do you mean to simply change the order of function application? — Kaoru, Jun 16 '21 at 02:01

transistor · Accepted Answer · 2021-06-16T03:23:21.893

The BNF rules as listed are ambiguous, since there's no easy way to distinguish between the correct way to parse a&&b. It could be parsed as a && b or a & &b, and since the latter option is the first thing the nom parser version will attempt to parse, it will return that option first.

In the log_and_expression function, the parser will call the and_expression function which will correctly match the a identifier followed by the first &, and then it will call unary_expression function to parse the remainder, which in this case can be correctly parsed as the unary & b expression, since that rule explicitly allows spaces between & and b. This gives you the output you get (which isn't what you want)

One way to fix this is to rewrite the parser so that it matches the smallest unary expression, followed by many0 of either a && follow by another unary expression, or & followed by a unary expression. Something like

<expr> = <unary_expression> (<space>* <operator_pair> <space>*)*
<operator_pair> = ("&&" <space>* <unary_expression>) | ("&" <space>* <unary_expression>)

If it tries to match && first, then that will be the favored way of parsing a&&b, and the alternative form can only be parsed if it's written with a space as in a& &b. This might require building a list of pairs of "operator" and "unary expression", which would then need to be reassembled into the correct tree-like format you're looking for. An advantage of doing this as a 2 stage process is that you can then correctly apply order of operations when supporting multiple operators with different precedence.

There's a simple example of this kind of approach with nom here

The actual program requires a more complex analysis. I want to parse `a & b && c | d` as `(a & b) && (c | d)`. can your opinion parse this program to appropriate structure? — Kaoru, Jun 16 '21 at 14:15
Yes, I've implemented this myself for an ML style syntax [here](https://github.com/transistorfet/molten/blob/master/src/parser.rs#L508). The `AST::fold_op` function takes an intermediate format parsed by nom and assembles a proper tree structure using the [shunting yard algorithm](https://en.wikipedia.org/wiki/Shunting-yard_algorithm) to take into account the order of operations. Without a two stage approach, this can be very hard for a parser to reliably and correctly parse on its own — transistor, Jun 16 '21 at 21:02

How can I parse this program correctly using Rust's parser library nom

1 Answers1