Rust - How to parse UTF-8 alphabetical characters in nom?

Question

I am trying to parse character sequences of alphabetical characters, including german umlauts (ä ö ü) and other alphabetical characters from the UTF-8 charset. This is the parser I tried first:

named!(
    parse(&'a str) -> Self,
    map!(
        alpha1,
        |s| Self { chars: s.into() }
    )
);

But it only works for ASCII alphabetical characters (a-zA-Z). I tried to perform the parsing char by char:

named!(
    parse(&str) -> Self,
    map!(
        take_while1!(nom::AsChar::is_alpha),
        |s| Self { chars: s.into() }
    )
);

But this won't even parse "hello", but result in an Incomplete(Size(1)) error:

How do you parse UTF-8 alphabetical characters in nom? A snippet from my code:

extern crate nom;

#[derive(PartialEq, Debug, Eq, Clone, Hash, Ord, PartialOrd)]
pub struct Word {
    chars: String,
}

impl From<&str> for Word {
    fn from(s: &str) -> Self {
        Self {
            chars: s.into(),
        }
    }
}

use nom::*;
impl Word {
    named!(
        parse(&str) -> Self,
        map!(
            take_while1!(nom::AsChar::is_alpha),
            |s| Self { chars: s.into() }
        )
    );
}


#[test]
fn parse_word() {
    let words = vec![
        "hello",
        "Hi",
        "aha",
        "Mathematik",
        "mathematical",
        "erfüllen"
    ];
    for word in words {
        assert_eq!(Word::parse(word).unwrap().1, Word::from(word));
    }
}

When I run this test,

cargo test parse_word

I get:

thread panicked at 'called `Result::unwrap()` on an `Err` value: Incomplete(Size(1))', ...

I know that chars are already UTF-8 encoded in Rust (thank heavens, almighty), but it seems that the nom library is not behaving as I would expect. I am using nom 5.1.0

score 2 · Answer 1 · answered Jan 11 '20 at 01:47

First nom 5 use function for parsing, I advice to use this form because error message are much better and the code is much cleaner.

You requierement is odd, you could just take the full input make it a string and over:

impl Word {
    fn parse(input: &str) -> IResult<&str, Self> {
        Ok((
            &input[input.len()..],
            Self {
                chars: input.to_string(),
            },
        ))
    }
}

But I guess your purpose is to parse a word, so here a example of what you could do:

#[derive(PartialEq, Debug, Eq, Clone, Hash, Ord, PartialOrd)]
pub struct Word {
    chars: String,
}

impl From<&str> for Word {
    fn from(s: &str) -> Self {
        Self { chars: s.into() }
    }
}

use nom::{character::complete::*, combinator::*, multi::*, sequence::*, IResult};

impl Word {
    fn parse(input: &str) -> IResult<&str, Self> {
        let (input, word) =
            delimited(space0, recognize(many1_count(none_of(" \t"))), space0)(input)?;
        Ok((
            input,
            Self {
                chars: word.to_string(),
            },
        ))
    }
}

#[test]
fn parse_word() {
    let words = vec![
        "hello",
        " Hi",
        "aha ",
        " Mathematik ",
        "  mathematical",
        "erfüllen ",
    ];
    for word in words {
        assert_eq!(Word::parse(word).unwrap().1, Word::from(word.trim()));
    }
}

You could also make a custom function that use is_alphabetic() instead of none_of(" \t") but this require make a custom error for nom and is currently in my opinion very annoying to do.

score 0 · Accepted Answer · answered Mar 19 '20 at 17:14

0

On this Github Issue a fellow contributor quickly whipped up a library (nom-unicode) to handle this nicely:

use nom_unicode::complete::{alphanumeric1};

impl Word {
    named!(
        parse(&'a str) -> Self,
        map!(
            alphanumeric1,
            |w| Self::new(w)
        )
    );
}

answered Mar 19 '20 at 17:14

stimulate

1,199
1
11
30

You really should use the last version, it is much nicer – Boiethios Mar 19 '20 at 17:25
@Boiethios which version do you mean? – stimulate Mar 19 '20 at 17:28
The API of the 5th one, as suggested in the other answer. For example, you can replace the `map!` macro with https://docs.rs/nom/5.1.1/nom/combinator/fn.map.html, etc. – Boiethios Mar 19 '20 at 17:32
Ah so you mean I should use the function combinators instead of the macro ones? What do you say would be the point? I am quite happy with the macro syntax. – stimulate Mar 19 '20 at 17:49

Rust - How to parse UTF-8 alphabetical characters in nom?

2 Answers2