9

I'm trying to make a Bison parser to handle UTF-8 characters. I don't want the parser to actually interpret the Unicode character values, but I want it to parse the UTF-8 string as a sequence of bytes.

Right now, Bison generates the following code which is problematic:

  if (yychar <= YYEOF)
    {
      yychar = yytoken = YYEOF;
      YYDPRINTF ((stderr, "Now at end of input.\n"));
    }

The problem is that many bytes of the UTF-8 string will have a negative value, and Bison interprets negative values as an EOF, and stops.

Is there a way around this?

Martin Cote
  • 28,864
  • 15
  • 75
  • 99

3 Answers3

8

bison yes, flex no. The one time I needed a bison parser to work with UTF-8 encoded files I ended up writing my own yylex function.

edit: To help, I used a lot of the Unicode operations available in glib (there's a gunicode type and some file/string manipulation functions that I found useful).

Sean Bright
  • 118,630
  • 17
  • 138
  • 146
eduffy
  • 39,140
  • 13
  • 95
  • 92
  • Well, my lexer handles the UTF-8 chars just fine, but the Bison parser stops parsing as soon as it sees a negative value. Please advise. – Martin Cote Jun 01 '09 at 14:52
  • Are you reading your file 1 byte at a time? or 1 utf-8 encoded character at a time? – eduffy Jun 01 '09 at 14:53
  • Then that's the problem. The bit that signifies a 'char' is negative in ASCII is the same bit that tells a UTF-8 char that it is more than 1 byte in length (IIRC). You need to use something other than fgetc. – eduffy Jun 01 '09 at 15:15
4

flex being the issue here, you might want to take a look at zlex.

chaos
  • 122,029
  • 33
  • 303
  • 309
  • That's an interesting project, but wouldn't exactly solve the problem addressed in this question. 16-bit characters are different from UTF-8 encoded characters (for one thing UTF-8 can be up to 4 bytes in length). – eduffy Jun 01 '09 at 15:21
0

This is an question from 4 years ago, but I'm facing the same issues and I'd like to share my ideas.

The problem is that in UTF-8 you don't know how many bytes to read. As suggested above you can use your own lexer, and have it either read whole lines, or have it read 4 bytes every time. Then extract the UTF-8 character from that, and read more bytes to complete again to 4 bytes.

  • 1
    Although you may not know how many bytes to read per character until you actually read them, you probably don't need to know. To properly tokenize the byte stream, all you really need to know is what byte patterns are significant as keywords, delimiters, etc.. The lexer doesn't need to interpret anything else; it just collects byte sequences into tokens. Even if you want to report character-literal tokens back to the caller, it is possible to write lexical pattern rules that match valid UTF-8 code sequences, and to use those to scan the incoming multibyte characters correctly. – John Bollinger Sep 26 '13 at 17:00