Remove invalid UTF-8 characters from a string

Question

I get this on json.Marshal of a list of strings:

json: invalid UTF-8 in string: "...ole\xc5\"

The reason is obvious, but how can I delete/replace such strings in Go? I've been reading docst on unicode and unicode/utf8 packages and there seems no obvious/quick way to do it.

In Python for example you have methods for it where the invalid characters can be deleted, replaced by a specified character or strict setting which raises exception on invalid chars. How can I do equivalent thing in Go?

UPDATE: I meant the reason for getting an exception (panic?) - illegal char in what json.Marshal expects to be valid UTF-8 string.

(how the illegal byte sequence got into that string is not important, the usual way - bugs, file corruption, other programs that do not conform to unicode, etc)

How is the reason obvious? I'd guess you have a latin1 (or some other variant of ISO8859) string there in which case you don't want a function to swallow these characters but instead convert them to UTF-8 before continuing ... — filmor, Dec 05 '13 at 14:28
In Go 1.2, the json parser will accept malformed UTF-8. It will replace malformed bytes with a replacement glyph. — fuz, Dec 07 '13 at 19:02

Inanc Gumus · Answer 1 · 2020-06-02T15:16:07.667

26

In Go 1.13+, you can do this:

strings.ToValidUTF8("a\xc5z", "")

In Go 1.11+, it's also very easy to do the same using the Map function and utf8.RuneError like this:

fixUtf := func(r rune) rune {
    if r == utf8.RuneError {
        return -1
    }
    return r
}

fmt.Println(strings.Map(fixUtf, "a\xc5z"))
fmt.Println(strings.Map(fixUtf, "posic�o"))

Output:

az
posico

Playground: Here.

edited Jun 02 '20 at 15:16

answered Oct 12 '18 at 17:56

Inanc Gumus

25,195
9
85
101

FYI, ```strings.ToValidUTF8``` didn't make it into Go 1.12, but looks like it is planned for Go 1.13: https://github.com/golang/go/issues/25805 – Jerry Clinesmith Jul 08 '19 at 19:26
NB: `strings.ToValidUTF8` doesn't touch the `\x00` char, so it won't help with, e.g. `invalid byte sequence for encoding "UTF8": 0x00` with postgres. You need to handle the case expicitly: `strings.ToValidUTF8(t, strings.ReplaceAll(s, "\x00", ""), "")` – ejoubaud Feb 14 '22 at 07:34

peterSO · Accepted Answer · 2013-12-08T06:31:58.873

23

For example,

package main

import (
    "fmt"
    "unicode/utf8"
)

func main() {
    s := "a\xc5z"
    fmt.Printf("%q\n", s)
    if !utf8.ValidString(s) {
        v := make([]rune, 0, len(s))
        for i, r := range s {
            if r == utf8.RuneError {
                _, size := utf8.DecodeRuneInString(s[i:])
                if size == 1 {
                    continue
                }
            }
            v = append(v, r)
        }
        s = string(v)
    }
    fmt.Printf("%q\n", s)
}

Output:

"a\xc5z"
"az"

Unicode Standard

FAQ - UTF-8, UTF-16, UTF-32 & BOM

Q: Are there any byte sequences that are not generated by a UTF? How should I interpret them?

A: None of the UTFs can generate every arbitrary byte sequence. For example, in UTF-8 every byte of the form 110xxxxx2 must be followed with a byte of the form 10xxxxxx2. A sequence such as <110xxxxx2 0xxxxxxx2> is illegal, and must never be generated. When faced with this illegal byte sequence while transforming or interpreting, a UTF-8 conformant process must treat the first byte 110xxxxx2 as an illegal termination error: for example, either signaling an error, filtering the byte out, or representing the byte with a marker such as FFFD (REPLACEMENT CHARACTER). In the latter two cases, it will continue processing at the second byte 0xxxxxxx2.

A conformant process must not interpret illegal or ill-formed byte sequences as characters, however, it may take error recovery actions. No conformant process may use irregular byte sequences to encode out-of-band information.

edited Dec 08 '13 at 06:31

answered Dec 05 '13 at 14:56

peterSO

158,998
31
281
276

While most likely completely irrelevant, your example might strip out completely correct encoded *Unicode Replacement Characters* (`"\xef\xbf\xbd"`) if the string also contains broken UTF8 sequences. – ANisus Dec 05 '13 at 15:14
@ANisus: The assumption is that people have read the Unicode Standard. – peterSO Dec 05 '13 at 15:24
My comment was just meant as trivia. My function would also strip away the replacement characters together with the illegal sequences (it is after all my +1 ;) ). I just said that the legal byte sequence of "\xef\xbf\xbd", which json.Marshal will accept, will also be stripped away. Not sure how the Unicode Standard would disagree with that. – ANisus Dec 05 '13 at 15:35
@ANisus: If you want to, you can keep any replacement characters. See my revised answer. – peterSO Dec 08 '13 at 06:41
@peterSO question: In the above context, does `unicode.ReplacementChar` bear a similarity to `utf8.RuneError` ? – Roy Lee Feb 05 '16 at 03:30
1

@Roylee: Same thing, different names: [`unicode.ReplacementChar = '\uFFFD'`](https://golang.org/pkg/unicode/#pkg-constants) and [`utf8.RuneError = '\uFFFD'`](https://golang.org/pkg/unicode/utf8/#pkg-constants). – peterSO Feb 05 '16 at 03:35

score 1 · Answer 3 · answered Oct 05 '21 at 23:48

Another way to do this, according to this answer, could be

s = string([]rune(s))

Example:

package main

import (
    "fmt"
    "unicode/utf8"
)

func main() {
    s := "...ole\xc5"
    fmt.Println(s, utf8.Valid([]byte(s)))
    // Output: ...ole� false

    s = string([]rune(s))
    fmt.Println(s, utf8.Valid([]byte(s)))
    // Output: ...ole� true
}

Even though the result doesn't look "pretty", it still nevertheless converts the string into a valid UTF-8 encoding.

Remove invalid UTF-8 characters from a string

3 Answers3

Linked