bytes to string conversion with invalid characters

Question

I need to parse UDP packets which can be invalid or contain some errors. I would like to replace invalid characters with . after a bytes to string conversion, in order to display the content of the packets.

How can I do it? This is my code:

func main() {
   a := []byte{'a', 0xff, 0xaf, 'b', 0xbf}
   s := string(a)
   s = strings.Replace(s, string(0xFFFD), ".", 0)

   fmt.Println("s: ", s) // I would like to display "a..b."
   for _, r := range s {
      fmt.Println("r: ", r)
   }
   rs := []rune(s)
   fmt.Println("rs: ", rs)
}

kostix · Accepted Answer · 2022-01-11T10:24:20.190

The root problem with your approach is that the result of type converting []byte to string does not have any U+FFFDs in it: this type-conversion only copies bytes from the source to the destination, verbatim.
Just as byte slices, strings in Go are not obliged to contain UTF-8-encoded text; they can contain any data, including opaque binary data which has nothing to do with text.

But some operations on strings—namely type-converting them to []rune and iterating over them using range—do interpret strings as UTF-8-encoded text. That is precisely where you got tripped: your range debugging loop attempted to interpret the string, and each time another attempt at decoding a properly encoded code point failed, range yielded a replacement character, U+FFFD.
To reiterate, the string obtained by the type-conversion does not contain the characters you wanted to get replaced by your regexp.

As to how to actually make a valid UTF-8-encoded string out of your data, you might employ a two-step process:

Type-convert your byte slice to a string—as you already do.
Use any means of interpreting a string as UTF-8—replacing U+FFFD which will dynamically appear during this process—as you're iterating.

Something like this:

var sb strings.Builder
for _, c := range string(b) {
  if c == '\uFFFD' {
    sb.WriteByte('.')
  } else {
    sb.WriteRune(c)
  }
}
return sb.String()

A note on performance: since type-converting a []byte to string copies memory—because strings are immutable while slices are not—the first step with type-conversion might be a waste of resources for code dealing with large chunks of data and/or working in tight processing loops.
In this case, it may be worth using the DecodeRune function of the encoding/utf8 package which works on byte slices. An example from its docs can be easily adapted to work with the loop above.

See also: Remove invalid UTF-8 characters from a string

the value `U+FFFD` is the value of the constant `utf8.RuneError` Which in turns is the value used by the package to tell the caller that the input is not a proper utf8 encoded data. https://pkg.go.dev/unicode/utf8#pkg-constants — , Jan 11 '22 at 10:30

LeGEC · Answer 2 · 2022-01-11T15:55:17.860

@kostix answer is correct and explains very clearly the issue with scanning unicode runes from a string.

Just adding the following remark : if your intention is to view characters only in the ASCII range (printable characters < 127) and you don't really care about other unicode code points, you can be more blunt :

// create a byte slice with the same byte length as s
var bs = make([]byte, len(s))

// scan s byte by byte :
for i := 0; i < len(s); i++ {
    switch {
    case 32 <= s[i] && s[i] <= 126:
        bs[i] = s[i]

    // depending on your needs, you may also keep characters in the 0..31 range,
    // like 'tab' (9), 'linefeed' (10) or 'carriage return' (13) :
    // case s[i] == 9, s[i] == 10, s[i] == 13:
    //   bs[i] = s[i]

    default:
        bs[i] = '.'
    }
}


fmt.Printf("rs: %s\n", bs)

playground

This function will give you something close to the "text" part of hexdump -C.

Good perspective, actually; +1. – kostix Jan 11 '22 at 09:52 — kostix, Jan 11 '22 at 09:52

icza · Answer 3 · 2022-01-11T15:56:09.093

You may want to use strings.ToValidUTF8() for this:

ToValidUTF8 returns a copy of the string s with each run of invalid UTF-8 byte sequences replaced by the replacement string, which may be empty.

It "seemingly" does exactly what you need. Testing it:

a := []byte{'a', 0xff, 0xaf, 'b', 0xbf}
s := strings.ToValidUTF8(string(a), ".")
fmt.Println(s)

Output (try it on the Go Playground):

a.b.

I wrote "seemingly" because as you can see, there's a single dot between a and b: because there may be 2 bytes, but a single invalid sequence.

Note that you may avoid the []byte => string conversion, because there's a bytes.ToValidUTF8() equivalent that operates on and returns a []byte:

a := []byte{'a', 0xff, 0xaf, 'b', 0xbf}
a = bytes.ToValidUTF8(a, []byte{'.'})
fmt.Println(string(a))

Output will be the same. Try this one on the Go Playground.

If it bothers you that multiple (invalid sequence) bytes may be shrinked into a single dot, read on.

Also note that to inspect arbitrary byte slices that may or may not contain texts, you may simply use hex.Dump() which generates an output like this:

a := []byte{'a', 0xff, 0xaf, 'b', 0xbf}
fmt.Println(hex.Dump(a))

Output:

00000000  61 ff af 62 bf                                    |a..b.|

There's your expected output a..b. with other (useful) data like the hex offset and hex representation of bytes.

To get a "better" picture of the output, try it with a little longer input:

a = []byte{'a', 0xff, 0xaf, 'b', 0xbf, 50: 0xff}
fmt.Println(hex.Dump(a))

00000000  61 ff af 62 bf 00 00 00  00 00 00 00 00 00 00 00  |a..b............|
00000010  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000020  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000030  00 00 ff                                          |...|

Try it on the Go Playground.

There's only one "dot" between `a`and `b` - which is misleading if one wants to see precise byte offsets. I think the OP really just wants printable (& unprintable) ASCII bytes. Using UTF-8 tricks over-complicates things. — colm.anseo, Jan 11 '22 at 13:20
@colm.anseo Yes, maybe. But the character `á` is printable and has code `225`: it even fits into a single byte! Yet in UTF-8 it occupies 2 bytes. If you print those 2 bytes for a single char, that also might be misleading. For accurate offsets, `hex.Dump()` should be preferred. — icza, Jan 11 '22 at 13:24
Agreed. The OP goal is UDP packet inspection - so just wants signal vs. marker bytes to be clearly delineated. — colm.anseo, Jan 11 '22 at 13:36

bytes to string conversion with invalid characters

3 Answers3