5

I'm trying to map UTF-8 characters to their "similar" ISO8859-1 representation. Removing diacritics, but also replacing characters like Ł with L or ı with i.

Example: José Kakışır should become Jose Kakisir.

I'm aware that removing diacritics can be done this way:

// (From https://blog.golang.org/normalization#TOC_10.)
import (
    "unicode"

    "golang.org/x/text/transform"
    "golang.org/x/text/unicode/norm"
)

isMn := func(r rune) bool {
    return unicode.Is(unicode.Mn, r) // Mn: nonspacing marks
}
t := transform.Chain(norm.NFD, transform.RemoveFunc(isMn), norm.NFC)
result, _, err := transform.String(t, "José Kakışır")
println(result)

Which prints out Jose Karısır - replaced with s, but ı not replaced with i.

What's the best way to achieve that in Go?

derFunk
  • 1,587
  • 2
  • 20
  • 31

2 Answers2

1

There are two ideas from the Unicode spec that might be used to identify "similar" characters.

The first is the decompositions of characters into a base character + a combining mark. Your code takes advantage of this: doing the decomposition and then removing the combining mark, leaving the base character.

But unfortunately the "i" character for some reason does not decompose into a dotless "ı" plus a combining dot (if anybody understands why this decision was made, please comment!). This fact is also discussed here: Why do LATIN SMALL LETTER DOTLESS I, COMBINING DOT ABOVE not get normalized to "i" in NFC form?

The second is the mapping of characters to "confusable" characters as defined in Unicode TR39. For example, you will find the following line in http://www.unicode.org/Public/security/latest/confusables.txt

0131 ; 0069 ; MA # ( ı → i ) LATIN SMALL LETTER DOTLESS I → LATIN SMALL LETTER I #

This mapping exists to identify strings that could be "confused" for other strings for security purposes (e.g. spoofing domains). It allows you to convert a string to its "skeleton": two strings with the same skeleton are potentially visibly confusable. For example the skeleton of "ỿℓ" is "paypal", and the skeleton of "José Kakışır" is "José Kakișir". You could try this for your purposes, but this is not recommended per the spec:

A skeleton is intended only for internal use for testing confusability of strings; the resulting text is not suitable for display to users, because it will appear to be a hodgepodge of different scripts. In particular, the result of mapping an identifier will not necessary be an identifier. Thus the confusability mappings can be used to test whether two identifiers are confusable (if their skeletons are the same), but should definitely not be used as a "normalization" of identifiers.

If you do choose to try this, here is a Go package: https://github.com/mtibben/confusables

Another option is a custom mapping of characters to logically similar characters suitable for your application, based on some knowledgable person's judgment about "similarity". I am not aware of any such mappings. Depending on your application you might try to do this manually.

Also note: "é" and many other accented characters is supported by the iso-8859-1 character set, so removing the accent is not necessary. Whatever you end up implementing, your code should first determine whether the rune is supported by the encoding before attempting to map it to a similar character.

Jonathan Warden
  • 2,492
  • 1
  • 17
  • 11
0

I believe the charmap package does what you want with a charmap.ISO8859_1.NewEncoder()

Edit: nevermind, that will barf on unsupported runes. Sorry. It may be worth looking into this package some more though.

Ultimately, it feels like you will need to find (or create) a mapping from UTF-8 to ISO8859. I don't think you'll find a "standard" one out there though, the mapping is too arbitrary.

Marc
  • 19,394
  • 6
  • 47
  • 51
  • Thanks, yes, this will barf with `encoding: rune not supported by encoding.`. For the mapping I'm thinking there must be something like a transliteration map, but I also had no success so far to find one. The problem is there are so many languages which would need to be properly transliterated, but I'd be okay with a simple subset of common "European" characters already. – derFunk Dec 05 '17 at 19:07
  • even iconv is unhappy: `echo "José Kakısır" | iconv -f UTF8 -t ISO_8859-1` fails with: `Jos� Kakiconv: illegal input sequence at position 9` – Marc Dec 05 '17 at 19:13
  • And so am I - :) – derFunk Dec 05 '17 at 19:39
  • The `iconv` has the `-c` command-line option which should make it happy ;-) – kostix Dec 06 '17 at 07:24
  • I think one may try to wrap an encoder for `ISO-8859-1` providing a custom implementation for the [`encoding.Encoder.ReplaceUnsupported` function](https://godoc.org/golang.org/x/text/encoding#ReplaceUnsupported). – kostix Dec 06 '17 at 07:26
  • 1
    That's not the type of happy I want to achieve: `-c When this option is given, characters that cannot be converted are silently discarded, instead of leading to a conversion error.` :-) – derFunk Dec 06 '17 at 13:40