2

I want to parse a string containing ASCII characters between single quotes and that can contain escaped single quotes by two ' in a row.

'string value contained between single quotes -> '' and so on...'

which should result in:

string value contained between single quotes -> ' and so on...

use nom::{
    bytes::complete::{tag, take_while},
    error::{ErrorKind, ParseError},
    sequence::delimited,
    IResult,
};

fn main() {
    let res = string_value::<(&str, ErrorKind)>("'abc''def'");

    assert_eq!(res, Ok(("", "abc\'def")));
}

pub fn is_ascii_char(chr: char) -> bool {
    chr.is_ascii()
}

fn string_value<'a, E: ParseError<&'a str>>(i: &'a str) -> IResult<&'a str, &'a str, E> {
    delimited(tag("'"), take_while(is_ascii_char), tag("'"))(i)
}

How can I detect escaped quotes and not the end of the string?

mottosson
  • 3,283
  • 4
  • 35
  • 73

2 Answers2

5

This is pretty tricky, but the following works:

//# nom = "5.0.1"
use nom::{
    bytes::complete::{escaped_transform, tag},
    character::complete::none_of,
    combinator::{recognize, map_parser},
    multi::{many0, separated_list},
    sequence::delimited,
    IResult,
};

fn main() {
    let (_, res) = parse_quoted("'abc''def'").unwrap();
    assert_eq!(res, "abc'def");
    let (_, res) = parse_quoted("'xy@$%!z'").unwrap();
    assert_eq!(res, "xy@$%!z");
    let (_, res) = parse_quoted("'single quotes -> '' and so on...'").unwrap();
    assert_eq!(res, "single quotes -> ' and so on...");
}

fn parse_quoted(input: &str) -> IResult<&str, String> {
    let seq = recognize(separated_list(tag("''"), many0(none_of("'"))));
    let unquote = escaped_transform(none_of("'"), '\'', tag("'"));
    let res = delimited(tag("'"), map_parser(seq, unquote), tag("'"))(input)?;

    Ok(res)
}

Some explanations:

  1. the parser seq recognizes any sequence that alternates between double quotes and anything else;
  2. unquote transforms any double quotes into single one;
  3. map_parser then combines the two together to produce the desired result.

Be aware that due to the use of escaped_transform combinator, the parsing result is String instead of &str. I.e., there are extra allocations.

edwardw
  • 12,652
  • 3
  • 40
  • 51
  • Thank you! But the allocation is a bit of a deal breaker for me since I will be parsing large amounts of text and want it to be as performant as possible! =) Isn't it possible to use the `bytes::complete::escaped` parser somehow? This `escaped(alphanumeric1, '\'', char('\''))(i)` does not work... =( – mottosson Oct 24 '19 at 06:25
  • @mottosson `escaped` and `escaped_transform` have different semantics, the former only check if the input conforms to the given escape rule. Replacing double quotes with single one requires to manipulate the inside of a potentially long string, I don't see how to avoid allocation. Unless you can relax that requirement somehow. – edwardw Oct 24 '19 at 06:39
  • Hmm... you are correct. Maybe the transform can be made in a separate step in a later stage to circumvent the initial performance implications. So can I use `escaped` then in a similar way as my previous comment to make the parser understand that the string does not end with the first single quote it finds if the next char is also a single quote? – mottosson Oct 24 '19 at 08:16
  • 1
    @mottosson then you can get rid of `unquote` and `map_parser` altogether. `let res = delimited(tag("'"), seq, tag("'"))(input)?;` would be sufficient, and of course the return type is going to be `IResult<&str, &str>`. – edwardw Oct 24 '19 at 10:15
  • Just another question about `many0`. It says in the docs that many0 `Repeats the embedded parser until it fails and returns the results in a Vec`. Does this mean it will allocate memory even if it's wrapped in `recognize`? – mottosson Oct 29 '19 at 09:17
  • @mottosson yes, it should. But even so, that'll be `Vec<&str>` and doesn't consume too much memory. Plus, here it is wrapped in `recognize` so the result is never bound to anything or returned, it should be discarded immediately. – edwardw Oct 29 '19 at 09:51
  • @mottosson that being said, you can experiment with other combintor such as `take_while` here to see if it is even better. – edwardw Oct 29 '19 at 10:24
  • Great. Thanks for clarifying! =) – mottosson Oct 29 '19 at 11:33
0

I'm learning nom and below is my trying.

let a = r###"'string value contained between single quotes -> '' and so on...'"###;

fn parser(input: &str) -> IResult<&str, &str> {
    let len = input.chars().count() - 2;
    delimited(tag("'"), take(len), tag("'"))(input)
}

let (remaining, mut matched) = parser(a).unwrap_or_default();

let sss = matched.replace("''", "'");
matched = &sss;
println!("remaining: {:#?}", remaining);
println!("matched: {:#?}", matched);

It prints this result:

remaining: ""
matched: "string value contained between single quotes -> ' and so on..."

My testing is based on nom 6.2.1.

Just a learner
  • 26,690
  • 50
  • 155
  • 234
  • Thanks, but this doesn't work in the real use case where the single quote string is part of a larger input string and doesn't end with '. There are more characters to parse after the last '. This could've been clearer in my question. – mottosson Aug 16 '21 at 11:08