0

Given a byte sequence that is valid UTF-8, I would like to iterate over each of the encoded characters. In my working environment (LuaTeX), only Lua 5.2 and LuaJIT 2.0.3 are available, so I cannot use any of the facilities provided by the UTF-8 library that comes with Lua 5.3.

Said library includes a pattern for the purpose stated, utf8.charpattern, which "matches exactly one UTF-8 byte sequence". I thought I could simply copy it and pass it to string.gmatch:

-- Taken from http://www.lua.org/manual/5.3/manual.html#pdf-utf8.charpattern
local utf8charpattern = "[\0-\x7F\xC2-\xF4][\x80-\xBF]*"

-- Correctly shows `14´ in both cases (see below)
print( string.len(utf8charpattern) )

local function print_characters( s )
    local i = 1
    
    for c in s:gmatch( utf8charpattern ) do
        io.stdout:write( "#" , i , "\t" , c , "\n" )
        
        i = i + 1
    end
end

print_characters( "cántico" )

Under Lua 5.2 (standard luatex), this approach works fine. However, if I switch to the LuaJIT variant (luajittex), it fails with the following message:

malformed pattern (missing ']')

After struggling with the issue for a while, I realized that the problem was with the \0 byte that's at the beginning of the pattern. The size of the string (14 bytes) is correctly known by the Lua implementation in both cases, but it seems that the pattern matching functions (I also tried string.match) under LuaJIT stop reading the pattern after finding the infamous 0-byte.

Questions

  • Is this behavior correct, or is it a bug in LuaJIT?
  • Be it correct behavior or a bug, how can it be worked around?
    I have already tried replacing \0 with %z, but it didn't seem to play well with the range feature being used. Also, just fiddling with it, I substituted the \0 with a \001, and it seemed to work fine. Although I don't know much about the subject, I doubt that any of my strings contains a 0-byte, so perhaps that's an acceptable solution. That said, if it is acceptable, please, justify it.
Community
  • 1
  • 1
djsp
  • 2,174
  • 2
  • 19
  • 40

0 Answers0