Rust + nom: how to wrap a parser to emit a token?

Question

So I'm trying to port a parser written in Javascript with parsimmon. In the parser I feed a string source input and get tokens. A token is simply an object with a type and a pair of offset values to mark its location in the source:

enum TokenType {
    Foo,
    Bar
}

struct Token {
    token_type: TokenType,
    start_offset: usize,
    end_offset: usize,
}

Then in parsimmon there is a convenient node operator which wraps a parser or a combinator, and make its output a node just like the Token struct described above. I'd like to recreate this behavior with nom, here is what I've got:

struct LexInput<'a> {
    source: &'a str,
    location: usize,
}

fn token<'a>(
    parser: impl Fn(&str) -> IResult<&'a str, &str>,
    token_type: TokenType,
) -> impl Fn(&LexInput) -> IResult<LexInput<'a>, Token> {
    move |input: &LexInput| {
        let start_offset = input.location;
        let (remaining_source, output) = parser(input.source)?;
        let end_offset = start_offset + output.len();
        let token = Token::new(token_type, start_offset, end_offset);
        let remaining = LexInput::new(remaining_source, end_offset);

        Ok((remaining, token))
    }
}

I'm still quite new to rust so it took me a while to get here, but the code looks promising, except that I don't know how to use it. Instinctively I wrote:

let (remaining, token) = token(tag("|"), TokenType::Bar)(&LexInput::new("|foo", 0)).unwrap();
assert_eq!(remaining.source, "foo");

But of course that does not work. The error message is confusing as usual:

expected associated type `<impl Fn(&str)-> Result<(&str, &str), nom::Err<nom::error::Error<&str>>> as FnOnce<(&str,)>>::Output`
   found associated type `<impl Fn(&str)-> Result<(&str, &str), nom::Err<nom::error::Error<&str>>> as FnOnce<(&str,)>>::Output`

I mean, what found and what expected seems exactly the same to me.

Could someone help me to figure out what's wrong here?

I'd expect to see the usual closure syntax here, like `fn(&str) -> IResult<...>` instead of the `impl Fn` form. Are you sure that's the right way to do it? — tadman, Jul 19 '22 at 08:04
@tadman here is the signature of `tag`: `pub fn tag>( tag: T ) -> impl Fn(Input) -> IResult where Input: InputTake + Compare, T: InputLength + Clone,`. That's why I use `impl Fn`. https://docs.rs/nom/latest/nom/bytes/complete/fn.tag.html — hillin, Jul 19 '22 at 08:08
Can you post an executable example? https://stackoverflow.com/help/minimal-reproducible-example — Dogbert, Jul 19 '22 at 08:09
@Dogbert I was trying to but it seems rust playground does not support nom anymore. I'll try to get a git repo together. — hillin, Jul 19 '22 at 08:11
There's a good playground [here](https://www.rustexplorer.com/) that supports top 10k crates. Anyway you don't need a playground, but something we can run. — Chayim Friedman, Jul 19 '22 at 08:22
I have a sneaking suspicion this is related to lifetimes that are omitted in the error response. `&'a str, &str` vs. `&str, &str` for example. Given an input of `"|"` you're going to get `&'static str` for the first, presumably? I'd expect the input and output lifetimes to be the same, so potentially `fn(&'a str) -> ...<&'a str, &str>`. — tadman, Jul 19 '22 at 08:35
@hilin don't link a git repository. What you do is: 1) create a new project with `cargo new`. 2) write your code in `main.rs` that reproduces your problem. No other files, just `main.rs`. If you need dependencies, like `nom`, it's of course fine to add them to `Cargo.toml`. Write your `main.rs` as short as possible to reproduce your problem, try to omit everything that isn't related to it. 3) Copy and paste your `main.rs` code here. Bonus points if you also copy+paste the lines you modified in `Cargo.toml`. — Finomnis, Jul 19 '22 at 09:53

Finomnis · Accepted Answer · 2022-07-19T11:28:21.797

Is this kind of what you were going for?

use nom::{bytes::complete::tag, IResult};

#[derive(Debug)]
pub enum TokenType {
    Foo,
    Bar,
}

#[derive(Debug)]
pub struct Token {
    pub token_type: TokenType,
    pub start_offset: usize,
    pub end_offset: usize,
}

#[derive(Debug)]
pub struct LexInput<'a> {
    source: &'a str,
    location: usize,
}

impl<'a> LexInput<'a> {
    fn new(source: &'a str, location: usize) -> Self {
        Self { source, location }
    }
}

impl Token {
    fn new(token_type: TokenType, start_offset: usize, end_offset: usize) -> Self {
        Self {
            token_type,
            start_offset,
            end_offset,
        }
    }
}

fn token<'a>(
    parser: impl Fn(&'a str) -> IResult<&'a str, &str>,
    token_type: TokenType,
) -> impl FnOnce(LexInput<'a>) -> IResult<LexInput<'a>, Token> {
    move |input: LexInput| {
        let start_offset = input.location;
        let (remaining_source, output) =
            parser(input.source).map_err(|e| e.map_input(|_| input))?;
        let end_offset = start_offset + output.len();
        let token = Token::new(token_type, start_offset, end_offset);
        let remaining = LexInput::new(remaining_source, end_offset);

        Ok((remaining, token))
    }
}

fn main() {
    let source = "|foo".to_string();
    let (remaining, token) = token(tag("|"), TokenType::Bar)(LexInput::new(&source, 0)).unwrap();
    println!("remaining: {:?}", remaining);
    println!("token: {:?}", token);
}

remaining: LexInput { source: "foo", location: 1 }
token: Token { token_type: Bar, start_offset: 0, end_offset: 1 }

Your main mistakes were lifetime related. Everywhere you don't annotate a lifetime, a default lifetime is taken, which does not fulfill 'a.

fn token<'a>(
    // The result can't be `'a` if it refers to the input `&str`, the input also has to be `'a`.
    parser: impl Fn(&str) -> IResult<&'a str, &str>,
    token_type: TokenType,
// Same here, `&LexInput` needs to be `'a`. But as it has a lifetime attached, just use that one instead: `LexInput<'a>`.
) -> impl Fn(&LexInput) -> IResult<LexInput<'a>, Token> {
     // Same here, although here the anonymous lifetime is sufficient to figure it out
    move |input: &LexInput| {
        let start_offset = input.location;
        // Here, an error conversion is missing, because the error carries the
        // input and therefore can't be just directly raised; parser has `&str`
        // as input, while `token` has `LexInput` as input. Luckily, the 
        //`map_input` method exists.
        let (remaining_source, output) = parser(input.source)?;
        let end_offset = start_offset + output.len();
        let token = Token::new(token_type, start_offset, end_offset);
        let remaining = LexInput::new(remaining_source, end_offset);

        Ok((remaining, token))
    }
}

Further remarks

There is already the nom_locate crate that does exactly what you are attempting to do here.

The big advantage of the nom_locate crate is that the LocatedSpan type can directly be used by nom's parsers. No need to convert back and forth between your type and &str. This makes the code a lot simpler.

use nom::{bytes::complete::tag, IResult};

use nom_locate::LocatedSpan;

type Span<'a> = LocatedSpan<&'a str>;

#[derive(Debug)]
pub enum TokenType {
    Foo,
    Bar,
}

#[derive(Debug)]
pub struct Token {
    pub token_type: TokenType,
    pub start_offset: usize,
    pub end_offset: usize,
}

impl Token {
    fn new(token_type: TokenType, start_offset: usize, end_offset: usize) -> Self {
        Self {
            token_type,
            start_offset,
            end_offset,
        }
    }
}

fn token<'a>(
    parser: impl Fn(Span<'a>) -> IResult<Span<'a>, Span<'a>>,
    token_type: TokenType,
) -> impl FnOnce(Span<'a>) -> IResult<Span<'a>, Token> {
    move |input: Span| {
        let start_offset = input.location_offset();
        let (remaining, _) = parser(input)?;
        let end_offset = remaining.location_offset();
        let token = Token::new(token_type, start_offset, end_offset);
        Ok((remaining, token))
    }
}

fn main() {
    let source = "|foo".to_string();
    let (remaining, token) = token(tag("|"), TokenType::Bar)(Span::new(&source)).unwrap();
    println!("remaining: {:?}", remaining);
    println!("token: {:?}", token);
}

remaining: LocatedSpan { offset: 1, line: 1, fragment: "foo", extra: () }
token: Token { token_type: Bar, start_offset: 0, end_offset: 1 }

With the help of nom::combinator::map and a little bit of restructuring, you can reduce it down even further:

use nom::{bytes::complete::tag, combinator::map, IResult};

use nom_locate::LocatedSpan;

type Span<'a> = LocatedSpan<&'a str>;

#[derive(Debug, Clone)]
pub enum TokenType {
    Foo,
    Bar,
}

#[derive(Debug)]
pub struct Token {
    pub token_type: TokenType,
    pub start_offset: usize,
    pub end_offset: usize,
}

impl Token {
    fn new(token_type: TokenType, start_offset: usize, end_offset: usize) -> Self {
        Self {
            token_type,
            start_offset,
            end_offset,
        }
    }
}

fn token<'a>(
    parser: impl Fn(Span<'a>) -> IResult<Span<'a>, Span<'a>>,
    token_type: TokenType,
) -> impl FnMut(Span<'a>) -> IResult<Span<'a>, Token> {
    map(parser, move |matched| {
        Token::new(
            token_type.clone(),
            matched.location_offset(),
            matched.location_offset() + matched.len(),
        )
    })
}

fn main() {
    let source = "|foo".to_string();
    let (remaining, token) = token(tag("|"), TokenType::Bar)(Span::new(&source)).unwrap();
    println!("remaining: {:?}", remaining);
    println!("token: {:?}", token);
}

remaining: LocatedSpan { offset: 1, line: 1, fragment: "foo", extra: () }
token: Token { token_type: Bar, start_offset: 0, end_offset: 1 }

Rust + nom: how to wrap a parser to emit a token?

1 Answers1

Further remarks