2

So I'm trying to port a parser written in Javascript with parsimmon. In the parser I feed a string source input and get tokens. A token is simply an object with a type and a pair of offset values to mark its location in the source:

enum TokenType {
    Foo,
    Bar
}

struct Token {
    token_type: TokenType,
    start_offset: usize,
    end_offset: usize,
}

Then in parsimmon there is a convenient node operator which wraps a parser or a combinator, and make its output a node just like the Token struct described above. I'd like to recreate this behavior with nom, here is what I've got:

struct LexInput<'a> {
    source: &'a str,
    location: usize,
}

fn token<'a>(
    parser: impl Fn(&str) -> IResult<&'a str, &str>,
    token_type: TokenType,
) -> impl Fn(&LexInput) -> IResult<LexInput<'a>, Token> {
    move |input: &LexInput| {
        let start_offset = input.location;
        let (remaining_source, output) = parser(input.source)?;
        let end_offset = start_offset + output.len();
        let token = Token::new(token_type, start_offset, end_offset);
        let remaining = LexInput::new(remaining_source, end_offset);

        Ok((remaining, token))
    }
}

I'm still quite new to rust so it took me a while to get here, but the code looks promising, except that I don't know how to use it. Instinctively I wrote:

let (remaining, token) = token(tag("|"), TokenType::Bar)(&LexInput::new("|foo", 0)).unwrap();
assert_eq!(remaining.source, "foo");

But of course that does not work. The error message is confusing as usual:

expected associated type `<impl Fn(&str)-> Result<(&str, &str), nom::Err<nom::error::Error<&str>>> as FnOnce<(&str,)>>::Output`
   found associated type `<impl Fn(&str)-> Result<(&str, &str), nom::Err<nom::error::Error<&str>>> as FnOnce<(&str,)>>::Output`

I mean, what found and what expected seems exactly the same to me.

Could someone help me to figure out what's wrong here?

hillin
  • 1,603
  • 15
  • 21
  • I'd expect to see the usual closure syntax here, like `fn(&str) -> IResult<...>` instead of the `impl Fn` form. Are you sure that's the right way to do it? – tadman Jul 19 '22 at 08:04
  • What does `tag` return? – tadman Jul 19 '22 at 08:05
  • @tadman here is the signature of `tag`: `pub fn tag>( tag: T ) -> impl Fn(Input) -> IResult where Input: InputTake + Compare, T: InputLength + Clone,`. That's why I use `impl Fn`. https://docs.rs/nom/latest/nom/bytes/complete/fn.tag.html – hillin Jul 19 '22 at 08:08
  • Can you post an executable example? https://stackoverflow.com/help/minimal-reproducible-example – Dogbert Jul 19 '22 at 08:09
  • @Dogbert I was trying to but it seems rust playground does not support nom anymore. I'll try to get a git repo together. – hillin Jul 19 '22 at 08:11
  • 1
    There's a good playground [here](https://www.rustexplorer.com/) that supports top 10k crates. Anyway you don't need a playground, but something we can run. – Chayim Friedman Jul 19 '22 at 08:22
  • 1
    I have a sneaking suspicion this is related to lifetimes that are omitted in the error response. `&'a str, &str` vs. `&str, &str` for example. Given an input of `"|"` you're going to get `&'static str` for the first, presumably? I'd expect the input and output lifetimes to be the same, so potentially `fn(&'a str) -> ...<&'a str, &str>`. – tadman Jul 19 '22 at 08:35
  • 1
    @hilin don't link a git repository. What you do is: 1) create a new project with `cargo new`. 2) write your code in `main.rs` that reproduces your problem. No other files, just `main.rs`. If you need dependencies, like `nom`, it's of course fine to add them to `Cargo.toml`. Write your `main.rs` as short as possible to reproduce your problem, try to omit everything that isn't related to it. 3) Copy and paste your `main.rs` code here. Bonus points if you also copy+paste the lines you modified in `Cargo.toml`. – Finomnis Jul 19 '22 at 09:53

1 Answers1

1

Is this kind of what you were going for?

use nom::{bytes::complete::tag, IResult};

#[derive(Debug)]
pub enum TokenType {
    Foo,
    Bar,
}

#[derive(Debug)]
pub struct Token {
    pub token_type: TokenType,
    pub start_offset: usize,
    pub end_offset: usize,
}

#[derive(Debug)]
pub struct LexInput<'a> {
    source: &'a str,
    location: usize,
}

impl<'a> LexInput<'a> {
    fn new(source: &'a str, location: usize) -> Self {
        Self { source, location }
    }
}

impl Token {
    fn new(token_type: TokenType, start_offset: usize, end_offset: usize) -> Self {
        Self {
            token_type,
            start_offset,
            end_offset,
        }
    }
}

fn token<'a>(
    parser: impl Fn(&'a str) -> IResult<&'a str, &str>,
    token_type: TokenType,
) -> impl FnOnce(LexInput<'a>) -> IResult<LexInput<'a>, Token> {
    move |input: LexInput| {
        let start_offset = input.location;
        let (remaining_source, output) =
            parser(input.source).map_err(|e| e.map_input(|_| input))?;
        let end_offset = start_offset + output.len();
        let token = Token::new(token_type, start_offset, end_offset);
        let remaining = LexInput::new(remaining_source, end_offset);

        Ok((remaining, token))
    }
}

fn main() {
    let source = "|foo".to_string();
    let (remaining, token) = token(tag("|"), TokenType::Bar)(LexInput::new(&source, 0)).unwrap();
    println!("remaining: {:?}", remaining);
    println!("token: {:?}", token);
}
remaining: LexInput { source: "foo", location: 1 }
token: Token { token_type: Bar, start_offset: 0, end_offset: 1 }

Your main mistakes were lifetime related. Everywhere you don't annotate a lifetime, a default lifetime is taken, which does not fulfill 'a.

fn token<'a>(
    // The result can't be `'a` if it refers to the input `&str`, the input also has to be `'a`.
    parser: impl Fn(&str) -> IResult<&'a str, &str>,
    token_type: TokenType,
// Same here, `&LexInput` needs to be `'a`. But as it has a lifetime attached, just use that one instead: `LexInput<'a>`.
) -> impl Fn(&LexInput) -> IResult<LexInput<'a>, Token> {
     // Same here, although here the anonymous lifetime is sufficient to figure it out
    move |input: &LexInput| {
        let start_offset = input.location;
        // Here, an error conversion is missing, because the error carries the
        // input and therefore can't be just directly raised; parser has `&str`
        // as input, while `token` has `LexInput` as input. Luckily, the 
        //`map_input` method exists.
        let (remaining_source, output) = parser(input.source)?;
        let end_offset = start_offset + output.len();
        let token = Token::new(token_type, start_offset, end_offset);
        let remaining = LexInput::new(remaining_source, end_offset);

        Ok((remaining, token))
    }
}

Further remarks

There is already the nom_locate crate that does exactly what you are attempting to do here.

The big advantage of the nom_locate crate is that the LocatedSpan type can directly be used by nom's parsers. No need to convert back and forth between your type and &str. This makes the code a lot simpler.

use nom::{bytes::complete::tag, IResult};

use nom_locate::LocatedSpan;

type Span<'a> = LocatedSpan<&'a str>;

#[derive(Debug)]
pub enum TokenType {
    Foo,
    Bar,
}

#[derive(Debug)]
pub struct Token {
    pub token_type: TokenType,
    pub start_offset: usize,
    pub end_offset: usize,
}

impl Token {
    fn new(token_type: TokenType, start_offset: usize, end_offset: usize) -> Self {
        Self {
            token_type,
            start_offset,
            end_offset,
        }
    }
}

fn token<'a>(
    parser: impl Fn(Span<'a>) -> IResult<Span<'a>, Span<'a>>,
    token_type: TokenType,
) -> impl FnOnce(Span<'a>) -> IResult<Span<'a>, Token> {
    move |input: Span| {
        let start_offset = input.location_offset();
        let (remaining, _) = parser(input)?;
        let end_offset = remaining.location_offset();
        let token = Token::new(token_type, start_offset, end_offset);
        Ok((remaining, token))
    }
}

fn main() {
    let source = "|foo".to_string();
    let (remaining, token) = token(tag("|"), TokenType::Bar)(Span::new(&source)).unwrap();
    println!("remaining: {:?}", remaining);
    println!("token: {:?}", token);
}
remaining: LocatedSpan { offset: 1, line: 1, fragment: "foo", extra: () }
token: Token { token_type: Bar, start_offset: 0, end_offset: 1 }

With the help of nom::combinator::map and a little bit of restructuring, you can reduce it down even further:

use nom::{bytes::complete::tag, combinator::map, IResult};

use nom_locate::LocatedSpan;

type Span<'a> = LocatedSpan<&'a str>;

#[derive(Debug, Clone)]
pub enum TokenType {
    Foo,
    Bar,
}

#[derive(Debug)]
pub struct Token {
    pub token_type: TokenType,
    pub start_offset: usize,
    pub end_offset: usize,
}

impl Token {
    fn new(token_type: TokenType, start_offset: usize, end_offset: usize) -> Self {
        Self {
            token_type,
            start_offset,
            end_offset,
        }
    }
}

fn token<'a>(
    parser: impl Fn(Span<'a>) -> IResult<Span<'a>, Span<'a>>,
    token_type: TokenType,
) -> impl FnMut(Span<'a>) -> IResult<Span<'a>, Token> {
    map(parser, move |matched| {
        Token::new(
            token_type.clone(),
            matched.location_offset(),
            matched.location_offset() + matched.len(),
        )
    })
}

fn main() {
    let source = "|foo".to_string();
    let (remaining, token) = token(tag("|"), TokenType::Bar)(Span::new(&source)).unwrap();
    println!("remaining: {:?}", remaining);
    println!("token: {:?}", token);
}
remaining: LocatedSpan { offset: 1, line: 1, fragment: "foo", extra: () }
token: Token { token_type: Bar, start_offset: 0, end_offset: 1 }
Finomnis
  • 18,094
  • 1
  • 20
  • 27