1

How to convert a utf8 string to ISO-8859-1 in golang

Have tried to search but can only find conversions the other way and the few solutions I found didn't work

I need to convert string with special danish chars like æ, ø and å

ø => ø etc.

clarkk
  • 27,151
  • 72
  • 200
  • 340

1 Answers1

4

Keep in mind that ISO-8859-1 only supports a tiny subset of characters compared to Unicode. If you know for certain that your UTF-8 encoded string only contains characters covered by ISO-8859-1, you can use the following code.

package main

import (
    "fmt"

    "golang.org/x/text/encoding/charmap"
)

func main() {
    str := "Räv"

    encoder := charmap.ISO8859_1.NewEncoder()
    out, err := encoder.Bytes([]byte(str))
    if err != nil {
        panic(err)
    }

    fmt.Printf("%x\n", out)
}

The above prints:

52e476

So 0x52, 0xE4, 0x76, which looks correct as per https://en.wikipedia.org/wiki/ISO/IEC_8859-1 - in particular the second character is of note, since it would be encoded as 0xC3, 0xA4 in UTF-8.

If the string contains characters that aren't supported, e.g. we change str to be "Rävv", then an error is going to be returned by encoder.Bytes([]byte(str)):

panic: encoding: rune not supported by encoding.

goroutine 1 [running]:
main.main()
/Users/nj/Dev/scratch/main.go:15 +0x109

If you wish to address that by accepting loss of unconvertible characters, a simple solution might be to leverage EncodeRune, which returns a boolean to indicate if the rune is in the charmap's repertoire.

package main

import (
    "fmt"

    "golang.org/x/text/encoding/charmap"
)

func main() {
    str := "Rävv"
    out := make([]byte, 0)

    for _, r := range str {
        if e, ok := charmap.ISO8859_1.EncodeRune(r); ok {
            out = append(out, e)
        }
    }

    fmt.Printf("%x\n", out)
}

The above prints

52e47676

i.e. the emoji has been stripped.

nj_
  • 2,219
  • 1
  • 10
  • 12
  • 1
    Stripping of unsupported characters is a horrible "solution" which produces complete garbage in many situations. Please don't do that unless you have a good understanding and complete control of the entire data pipeline; or perhaps expose the option to the end user, but default to raising an error rather than outputting corrupted results. – tripleee Oct 22 '22 at 12:36
  • 1
    The very nature of the question suggests a likely scenario whereby this is going to fail. As I said, `ISO-8859-1` only covers a tiny subset of characters catered for by Unicode. The very nature of converting from `UTF-8` to `ISO-8859-1` either implies loss or failure. OP didn't indicate the scenario where they are making this conversion, so I've provided both options - either attempt and error, or accept that certain characters cannot be converted. – nj_ Oct 22 '22 at 13:17
  • @nj_ how to I install `golang.org/x/text/encoding/charmap`? I'm fairly new to golang.. It will not fail because it's for pure text/file names etc – clarkk Oct 22 '22 at 15:03
  • ahh.. haha.. instructions in the compiler error :) – clarkk Oct 22 '22 at 15:07
  • your solution is stripping valid danish chars like `ø` which in utf8 is `ø` – clarkk Oct 22 '22 at 15:16
  • 1
    If I set `str` to `"ø"` then the supplied code will output `f8` - which is the correct way to encode `ø` in `ISO-8859-1`. Refer to the table under https://en.wikipedia.org/wiki/ISO/IEC_8859-1#Code_page_layout – nj_ Oct 22 '22 at 15:25
  • I don't disagree, I just wished the answer was more explicit about the drawbacks. It seems, every minute there's a new visitor to Stack Overflow who thinks ruining their output is better than explicit failure, and then we are left to clean up the mess, sometimes years later. Good job security for those with the patience and the stamina, but frustrating for users and bad for efficiency and ultimately the planet. – tripleee Oct 22 '22 at 16:00
  • @tripleee ok?? then tell my how to add filenames to a tar.gz with utf8 chars? ;) – clarkk Oct 23 '22 at 15:47
  • That depends on the underlying file system's support. The tar file format itself accepts arbitrary bytes in file names, with the exception of slashes (which are reserved as directory separators) and null bytes. – tripleee Oct 23 '22 at 16:52
  • @tripleee then try reading this... https://superuser.com/questions/60379/how-can-i-create-a-zip-tgz-in-linux-such-that-windows-has-proper-filenames/60591#60591 Just to sum it up: `Basically it's a horrible mess and if you can avoid distributing archives containing filenames with non-ASCII characters you'll be much better off.` – clarkk Oct 24 '22 at 08:37
  • That doesn't contradict what I said. If you have a problem, it's because your file system doesn't know what to do with UTF-8. I can't help but think switching to Latin-1 is objectively worse in every possible way (though you'd need to use an obscure Windows legacy code page like 437 to really reach the inner circles of the inferno if that's your actual goal). To actually meaningfully solve the problem in this situation, you'd have to convert to pure 7-bit ASCII instead, perhaps then with some sort of escaping convention to represent nonprintable characters losslessly. – tripleee Oct 24 '22 at 09:23