3

Looking for an implementation or library (ideally in Java), that will transform Unicode text such as below, to the corresponding ASCII English characters:

ʀᴇɢɪꜱᴛʀᴀᴛɪᴏɴ

The below should be converted to:

REGISTRATION

Note however that are other possible characters to be converted such as in "cσdє".

The final goal is to do a phenetic/fuzzy match, however I believe that becomes easy once the characters are actual ASCII english.

abdelrahman-sinno
  • 1,157
  • 1
  • 12
  • 33
  • Well, the big question is that you need to know what to map to what other character. Actually doing the replacing is a matter of just calling `replaceAll`. – Ben Jun 04 '18 at 11:49
  • @Ben I have a sample of about 60 strings, and I could work on implementing a character mapping covering the whole 'known' set. However since there are so many characters I am checking if anyone has put work on this already. – abdelrahman-sinno Jun 04 '18 at 11:56
  • 1
    There's a relevant post here: https://security.stackexchange.com/questions/128286/list-of-visually-similar-characters-for-detecting-spoofing-and-social-engineeri – Graham Asher Jun 07 '18 at 18:02

1 Answers1

1

It turns out they are called Homoglyphs, so we're trying to protect against Homoglyph/Homograph attacks.

I have found this library, Homoglyph Detection, to be a good starting point for a solution; they provide good mappings, however incomplete to really stop spam.

It would be nice to have such unicode-to-latin mapping files being shared and completed by the community.

abdelrahman-sinno
  • 1,157
  • 1
  • 12
  • 33