There are invalid byte sequences that can't be converted to Unicode strings. How do I detect that when converting []byte
to string
in Go?
-
5If your encoding is UTF-8: [`unicode/utf8.Valid`](https://godoc.org/unicode/utf8#Valid) – Jan 18 '16 at 18:26
-
1Read http://blog.golang.org/strings -- the `string` conversion is just a byte-for-byte copy, and you'll run into the invalid UTF-8 when you `range` over the code points (or convert to a `[]rune`) and get 0xFFFD, the Unicode replacement character (spec'd [here](https://golang.org/ref/spec#For_statements)). None of that crashes, so you only need to check validity if it's independently important for your app (or if you plan to do something else that _will_ crash on bad utf8). – twotwotwo Jan 18 '16 at 18:44
-
1Here's a sample with invalid UTF-8: http://play.golang.org/p/6yHonH0Mae – twotwotwo Jan 18 '16 at 18:47
1 Answers
You can, as Tim Cooper noted, test UTF-8 validity with utf8.Valid
.
But! You might be thinking that converting non-UTF-8 bytes to a Go string
is impossible. In fact, "In Go, a string is in effect a read-only slice of bytes"; it can contain bytes that aren't valid UTF-8 which you can print, access via indexing, pass to WriteString
methods, or even round-trip back to a []byte
(to Write
, say).
There are two places in the language that Go does do UTF-8 decoding of string
s for you.
- when you do
for i, r := range s
ther
is a Unicode code point as a value of typerune
- when you do the conversion
[]rune(s)
, Go decodes the whole string to runes.
(Note that rune
is an alias for int32
, not a completely different type.)
In both these instances invalid UTF-8 is replaced with U+FFFD
, the replacement character reserved for uses like this. More is in the spec sections on for
statements and conversions between string
s and other types. These conversions never crash, so you only need to actively check for UTF-8 validity if it's relevant to your application, like if you can't accept the U+FFFD replacement and need to throw an error on mis-encoded input.
Since that behavior's baked into the language, you can expect it from libraries, too. U+FFFD
is utf8.RuneError
and returned by functions in utf8
.
Here's a sample program showing what Go does with a []byte
holding invalid UTF-8:
package main
import "fmt"
func main() {
a := []byte{0xff}
s := string(a)
fmt.Println(s)
for _, r := range s {
fmt.Println(r)
}
rs := []rune(s)
fmt.Println(rs)
}
Output will look different in different environments, but in the Playground it looks like
�
65533
[65533]

- 28,310
- 8
- 69
- 56