UTF-8 match position

Question

Is it somehow possible to get the character position of matched pattern in Ragel?

I know a match receives a pointer into the string (char *), i.e. the byte-offset where the pattern was found inside of the string. The problem is that UTF-8 is variable-length encoding and thus characters and bytes do not have to align.

For example, if I wanted to search for $ in €€$ I would like to get 2, instead of 6 ($ → 0x24, € → 0xE282AC).

ArtemGr · Accepted Answer · 2015-01-24T08:50:48.680

Ragel generates a tight piece of source code which is embedded into your favorite language. This code doesn't use any libraries, neither provided by Ragel nor the language standard library. As such, it has no means to parse UTF-8 or calculate a length of a UTF-8 string.

What it can do, though, is to give you the pointers into the portion of the string you're interested in. Given that, you might calculate it's UTF-8 length using your favorite language-specific tools. For example, in C++ you could use the cxxtools' Utf8Codec::do_length method (or any other library you can think of) to get the UTF-8 length of the €€ piece after Ragel code returns it to you.

You can also tune Ragel to use 16-bit characters and feed UCS-2 to it, as discussed by Wil Macaulay and Wincent Colaiuta. 32-bit characters with UCS-4 should be even better.

Yet another angle could be to generate a state machine handing the UTF-8 using the unicode2ragel.rb script and attempt to modify it to count the number of transitions. (I've no idea whether that'll work or not, never used that state machine myself).

There is one thing which I don't understand. My question above as it is written now as the implicit assumption (sorry about that!) that I already have a machine which matched some patterns I'm interested in. How should I combine my machine with the machine for matching UTF-8, akin to what "unicode2ragel.rb" does? Should I count the code points both in each of my machine rule and each of UTF-8 rule? — Ecir Hana, Feb 16 '15 at 12:47
As I've never worked with `unicode2ragel.rb`, I'd recomment creating a new, more specific question on stackoverflow for somebody else to answer. Something along the lines of giving an example of how to incorporate a `unicode2ragel.rb` machine into an existing one and how to use it to count the codepoints. — ArtemGr, Feb 16 '15 at 13:29

UTF-8 match position

1 Answers1