How to detect when bytes can't be converted to string in Go?

Question

There are invalid byte sequences that can't be converted to Unicode strings. How do I detect that when converting []byte to string in Go?

If your encoding is UTF-8: [`unicode/utf8.Valid`](https://godoc.org/unicode/utf8#Valid) — , Jan 18 '16 at 18:26
Read http://blog.golang.org/strings -- the `string` conversion is just a byte-for-byte copy, and you'll run into the invalid UTF-8 when you `range` over the code points (or convert to a `[]rune`) and get 0xFFFD, the Unicode replacement character (spec'd [here](https://golang.org/ref/spec#For_statements)). None of that crashes, so you only need to check validity if it's independently important for your app (or if you plan to do something else that _will_ crash on bad utf8). — twotwotwo, Jan 18 '16 at 18:44
Here's a sample with invalid UTF-8: http://play.golang.org/p/6yHonH0Mae — twotwotwo, Jan 18 '16 at 18:47

twotwotwo · Accepted Answer · 2021-10-06T19:38:32.127

You can, as Tim Cooper noted, test UTF-8 validity with utf8.Valid.

But! You might be thinking that converting non-UTF-8 bytes to a Go string is impossible. In fact, "In Go, a string is in effect a read-only slice of bytes"; it can contain bytes that aren't valid UTF-8 which you can print, access via indexing, pass to WriteString methods, or even round-trip back to a []byte (to Write, say).

There are two places in the language that Go does do UTF-8 decoding of strings for you.

when you do for i, r := range s the r is a Unicode code point as a value of type rune
when you do the conversion []rune(s), Go decodes the whole string to runes.

(Note that rune is an alias for int32, not a completely different type.)

In both these instances invalid UTF-8 is replaced with U+FFFD, the replacement character reserved for uses like this. More is in the spec sections on for statements and conversions between strings and other types. These conversions never crash, so you only need to actively check for UTF-8 validity if it's relevant to your application, like if you can't accept the U+FFFD replacement and need to throw an error on mis-encoded input.

Since that behavior's baked into the language, you can expect it from libraries, too. U+FFFD is utf8.RuneError and returned by functions in utf8.

Here's a sample program showing what Go does with a []byte holding invalid UTF-8:

package main

import "fmt"

func main() {
    a := []byte{0xff}
    s := string(a)
    fmt.Println(s)
    for _, r := range s {
        fmt.Println(r)
    }
    rs := []rune(s)
    fmt.Println(rs)
}

Output will look different in different environments, but in the Playground it looks like

�
65533
[65533]

How to detect when bytes can't be converted to string in Go?

1 Answers1

Linked