Replace Accented Characters in string to standard with LUA

Question

**THIS ERROR LOOKS LIKE IT IS A BUG IN UNITY. THE CODE SEEMS TO WORK FINE OUTSIDE OF TABLETOP SIMULATOR (THE GAME I AM MODDING)

I'm marking this as solved but leaving it for the mods to remove if needed, as the code might still be useful to other people googling. **

I'm trying to process a large string of a few lines.. and would like to have all the accented characters it find converted into standard characters. I have some code I got form the net for this but there is a small bug in the code and I do not understand how it works, so need some help on this issue if you are able.

function stripChars(str)
    local tableAccents = {}
        tableAccents["à"] = "a"
        tableAccents["á"] = "a"
        tableAccents["â"] = "a"
        tableAccents["ã"] = "a"
        tableAccents["ä"] = "a"
        tableAccents["ç"] = "c"
        tableAccents["è"] = "e"
        tableAccents["é"] = "e"
        tableAccents["ê"] = "e"
        tableAccents["ë"] = "e"
        tableAccents["ì"] = "i"
        tableAccents["í"] = "i"
        tableAccents["î"] = "i"
        tableAccents["ï"] = "i"
        tableAccents["ñ"] = "n"
        tableAccents["ò"] = "o"
        tableAccents["ó"] = "o"
        tableAccents["ô"] = "o"
        tableAccents["õ"] = "o"
        tableAccents["ö"] = "o"
        tableAccents["ù"] = "u"
        tableAccents["ú"] = "u"
        tableAccents["û"] = "u"
        tableAccents["ü"] = "u"
        tableAccents["ý"] = "y"
        tableAccents["ÿ"] = "y"
        tableAccents["À"] = "A"
        tableAccents["Á"] = "A"
        tableAccents["Â"] = "A"
        tableAccents["Ã"] = "A"
        tableAccents["Ä"] = "A"
        tableAccents["Ç"] = "C"
        tableAccents["È"] = "E"
        tableAccents["É"] = "E"
        tableAccents["Ê"] = "E"
        tableAccents["Ë"] = "E"
        tableAccents["Ì"] = "I"
        tableAccents["Í"] = "I"
        tableAccents["Î"] = "I"
        tableAccents["Ï"] = "I"
        tableAccents["Ñ"] = "N"
        tableAccents["Ò"] = "O"
        tableAccents["Ó"] = "O"
        tableAccents["Ô"] = "O"
        tableAccents["Õ"] = "O"
        tableAccents["Ö"] = "O"
        tableAccents["Ù"] = "U"
        tableAccents["Ú"] = "U"
        tableAccents["Û"] = "U"
        tableAccents["Ü"] = "U"
        tableAccents["Ý"] = "Y"
    local normalizedString = ''

    for strChar in string.gmatch(str, "([%z\1-\127\194-\244][\128-\191]*)") do
        if tableAccents[strChar] ~= nil then
            normalizedString = normalizedString..tableAccents[strChar]
        else
            normalizedString = normalizedString..strChar
        end
    end
 return normalizedString
end

This code seems to work really well, but it doesn't work for the u type chars... so...

local test = "ù, ú, û, ü"
print(stripChars(test)) -- Prints (,,,)
test = "à, á, â, ã, ä"
print(stripChars(test)) -- Prints (a, a, a, a, a)

Any ideas?.. I assume it is something to do with the pattern thing.. but I do not see how exactly it works in the 1st place. (see the bottom of the code block under the large table of characters)

I cannot reproduce the problem (I get plain `u`s out) when I copy+paste your code as-is in Lua 5.3. It's possible something in encoding is lost in Stackoverflow. Could you explain `test` as a sequence of bytes? For example, using `print(test:byte(1, #test))` — Curtis Fenner, May 22 '18 at 04:04

score 2 · Accepted Answer · answered May 22 '18 at 04:11

I don't know why the function would work on "à, á, â, ã, ä" but would delete characters when used on "ù, ú, û, ü". The function assumes that both strings are encoded in UTF-8. Perhaps it is an encoding issue, but then I would expect it to fail in both cases. For me, calling the function on "ù, ú, û, ü" gives "u, u, u, u", as expected.

As Curtis F says, it might help to call print(string.byte(test, 1, -1)) on the string that is failing to find out how it is being encoded. I have the file encoded in UTF-8, so the values printed are 195 185 44 32 195 186 44 32 195 187 44 32 195 188.

How the function works is that "[%z\1-\127\194-\244][\128-\191]*" is a pattern that matches a single character (codepoint) encoded in the UTF-8 encoding. Each codepoint takes 1 to 4 bytes. The pattern, for instance, matches the single byte used to encode the comma character ("," is "\44") or the two two bytes that are used to encode the accented letters ("ù" is "\195\185"). The for-loop looks up each character in the tableAccents table, where the keys are accented letters and the values are the corresponding unaccented ones (tableAccents["ù"] → "u"). If the character is a key in the table, the value for that key is added to the normalizedString. If the character is not a key in the table, it is added without being changed. Thus the accented letters are replaced with unaccented ones, while other characters are left alone.

This is just a code cleanup suggestion: the for-loop could be simplified by using string.gsub:

local normalizedString = str:gsub("[%z\1-\127\194-\244][\128-\191]*", tableAccents)

score 1 · Answer 2 · answered May 23 '19 at 12:37

Just in case anyone needs a more complete list, I thought I'd add it here. Thanks for the help with this!

function stripChars(str)
  local tableAccents = {}
    tableAccents["À"] = "A"
    tableAccents["Á"] = "A"
    tableAccents["Â"] = "A"
    tableAccents["Ã"] = "A"
    tableAccents["Ä"] = "A"
    tableAccents["Å"] = "A"
    tableAccents["Æ"] = "AE"
    tableAccents["Ç"] = "C"
    tableAccents["È"] = "E"
    tableAccents["É"] = "E"
    tableAccents["Ê"] = "E"
    tableAccents["Ë"] = "E"
    tableAccents["Ì"] = "I"
    tableAccents["Í"] = "I"
    tableAccents["Î"] = "I"
    tableAccents["Ï"] = "I"
    tableAccents["Ð"] = "D"
    tableAccents["Ñ"] = "N"
    tableAccents["Ò"] = "O"
    tableAccents["Ó"] = "O"
    tableAccents["Ô"] = "O"
    tableAccents["Õ"] = "O"
    tableAccents["Ö"] = "O"
    tableAccents["Ø"] = "O"
    tableAccents["Ù"] = "U"
    tableAccents["Ú"] = "U"
    tableAccents["Û"] = "U"
    tableAccents["Ü"] = "U"
    tableAccents["Ý"] = "Y"
    tableAccents["Þ"] = "P"
    tableAccents["ß"] = "s"
    tableAccents["à"] = "a"
    tableAccents["á"] = "a"
    tableAccents["â"] = "a"
    tableAccents["ã"] = "a"
    tableAccents["ä"] = "a"
    tableAccents["å"] = "a"
    tableAccents["æ"] = "ae"
    tableAccents["ç"] = "c"
    tableAccents["è"] = "e"
    tableAccents["é"] = "e"
    tableAccents["ê"] = "e"
    tableAccents["ë"] = "e"
    tableAccents["ì"] = "i"
    tableAccents["í"] = "i"
    tableAccents["î"] = "i"
    tableAccents["ï"] = "i"
    tableAccents["ð"] = "eth"
    tableAccents["ñ"] = "n"
    tableAccents["ò"] = "o"
    tableAccents["ó"] = "o"
    tableAccents["ô"] = "o"
    tableAccents["õ"] = "o"
    tableAccents["ö"] = "o"
    tableAccents["ø"] = "o"
    tableAccents["ù"] = "u"
    tableAccents["ú"] = "u"
    tableAccents["û"] = "u"
    tableAccents["ü"] = "u"
    tableAccents["ý"] = "y"
    tableAccents["þ"] = "p"
    tableAccents["ÿ"] = "y"

  local normalisedString = ''

  local normalisedString = str: gsub("[%z\1-\127\194-\244][\128-\191]*", tableAccents)

  return normalisedString

end

artyfox · Answer 3 · 2022-08-09T03:22:06.397

I just thought I'd submit a modified function that uses a string map (instead of array) for easier character entries.

function normalizeLatin (str,ind)
    local unimask = "[%z\1-\127\194-\244][\128-\191]*"
    return str:gsub(unimask, function(unichar) 
        local charmap = --"Basic Latin".."Latin-1 Supplement".."Latin Extended-A".."Latin Extended-B"..
        "A".."ÀÁÂÃÄÅ".."ĀĂĄ".."ǍǞǠǺȀȂȦȺ"..
            "AE".."Æ".."".."ǢǼ"..
        "B".."ß".."".."ƁƂƄɃ"..
        "C".."Ç".."ĆĈĊČ".."ƆƇȻ"..
        "D".."Ð".."ĎĐ".."ƉƊ"..
            "DZ".."".."".."ƻǄǱ"..
            "Dz".."".."".."ǅǲ"..
        "E".."ÈÉÊË".."ĒĔĖĘĚ".."ƎƏƐȄȆȨɆ"..
        "F".."".."".."Ƒ"..
        "G".."".."ĜĞĠĢ".."ƓǤǦǴ"..
        "H".."".."ĤĦ".."Ȟ"..
            "Hu".."".."".."Ƕ"..
        "I".."ÌÍÎÏ".."ĨĪĬĮİ".."ƖƗǏȈȊ"..
            "IJ".."".."Ĳ"..""..
        "J".."".."Ĵ".."Ɉ"..
        "K".."".."Ķ".."ƘǨ"..
        "L".."".."ĹĻĽĿŁ".."Ƚ"..
            "LJ".."".."".."Ǉ"..
            "Lj".."".."".."ǈ"..
        "N".."Ñ".."ŃŅŇŊ".."ƝǸȠ"..
            "NJ".."".."".."Ǌ"..
            "Nj".."".."".."ǋ"..
        "O".."ÒÓÔÕÖØ".."ŌŎŐ".."ƟƠǑǪǬǾȌȎȪȬȮȰ"..
            "OE".."".."Œ"..
            "OI".."".."".."Ƣ"..
            "OU".."".."".."Ȣ"..
        "P".."Þ".."".."ƤǷ"..
        "Q".."".."".."Ɋ"..
        "R".."".."ŔŖŘ".."ȐȒɌ"..
        "S".."".."ŚŜŞŠ".."ƧƩƪƼȘ"..
        "T".."".."ŢŤŦ".."ƬƮȚȾ"..
        "U".."ÙÚÛÜ".."ŨŪŬŮŰŲ".."ƯƱƲȔȖɄǓǕǗǙǛ"..
        "V".."".."".."Ʌ"..
        "W".."".."Ŵ".."Ɯ"..
        "Y".."Ý".."ŶŸ".."ƳȜȲɎ"..
        "Z".."".."ŹŻŽ".."ƵƷƸǮȤ"..
        "a".."àáâãäå".."āăą".."ǎǟǡǻȁȃȧ"..
            "ae".."æ".."".."ǣǽ"..
        "b".."".."".."ƀƃƅ"..
        "c".."ç".."ćĉċč".."ƈȼ"..
        "d".."ð".."".."ƌƋƍȡďđ"..
            "db".."".."".."ȸ"..
            "dz".."".."".."ǆǳ"..    
        "e".."èéêë".."ēĕėęě".."ǝȅȇȩɇ"..
        "f".."".."".."ƒ"..
        "g".."".."ĝğġģ".."Ɣǥǧǵ"..
        "h".."".."ĥħ".."ȟ"..
            "hv".."".."".."ƕ"..
        "i".."ìíîï".."ĩīĭįı".."ǐȉȋ"..
            "ij".."".."ĳ"..""..
        "j".."".."ĵ".."ǰȷɉ"..
        "k".."".."ķĸ".."ƙǩ"..
        "l".."".."ĺļľŀł".."ƚƛȴ"..
            "lj".."".."".."ǉ"..
        "n".."ñ".."ńņňŉŋ".."ƞǹȵ"..
            "nj".."".."".."ǌ"..
        "o".."òóôõöø".."ōŏő".."ơǒǫǭǿȍȏȫȭȯȱ"..
            "oe".."".."œ"..""..
            "oi".."".."".."ƣ"..
            "ou".."".."".."ȣ"..
        "p".."þ".."".."ƥƿ"..
        "q".."".."".."ɋ"..
            "qp".."".."".."ȹ"..
        "r".."".."ŕŗř".."Ʀȑȓɍ"..
        "s".."".."śŝşšſ".."ƨƽșȿ"..
        "t".."".."ţťŧ".."ƫƭțȶ"..
            "ts".."".."".."ƾ"..
        "u".."ùúûü".."ũūŭůűų".."ưǔǖǘǚǜȕȗ"..
        "w".."".."ŵ"..""..
        "y".."ýÿ".."ŷ".."ƴȝȳɏ"..
        "z".."".."źżž".."ƶƹƺǯȥɀ"..
        ""
        unichar = unichar:gsub('[%(%)%.%%%+%-%*%?%[%^%$]','%%%0') --escape magic characters
        return unichar:match("%a") or charmap:match("(%a+)[^%a]-"..unichar)
    end, ind)
end

It covers the letters in Unicode blocks Latin-1 Supplement, Latin Extended-A and Latin Extended-B (U+00C0–U+024F, except "×÷ǀǁǂǃɁɂ"), but can be easily extended to more blocks.

In general, the mapping converts each character to its unaccented, unmirrored, uncombined, and unstylized form using this article as a reference, with the special cases of Yogh mapped to 'y' instead of 'z',

yogh (ȝogh) (Ȝ ȝ; Scots: yoch; Middle English: ȝogh) was used in Middle English and Older Scots, representing y (/j/) and various velar phonemes.

and Hwair (ƕ) mapped to 'hv' instead of 'hu'.

The Gothic letter is transliterated with the Latin ligature of the same name, ƕ, which was introduced by philologists around 1900 to replace the digraph hv

Here are some sample inputs/outputs:

input: "ƮĤË ɊǕĨȻǨ ɃȐØŴǸ ƑÕX ĴŲMǷŞ ƠɅƐȐ ƬĤȨ ĿǺƵȲ ÐƟĞ."
output: "THE QUICK BROWN FOX JUMPS OVER THE LAZY DOG."

input: "țĥé ɋŭíçĸ ƀŗôŵñ ƒøx ĵûmþś ôvǝȑ ȶħȩ ȴãƺɏ ƌȭğ."
output: "the quick brown fox jumps over the lazy dog."

Hopefully this helps somebody, somewhere!

Replace Accented Characters in string to standard with LUA

3 Answers3