8

I am trying to write a function to truncate strings with special characters in golang. One example is below

"H㐀〾▓朗퐭텟şüöžåйкл¤"

However I am doing it based on the number of characters allowed and cutting it in the middle. This results in data getting corrupted.

The result comes out like

H㐀〾▓朗퐭텟şüöžå�...

The should not be there. How do we detect these special characters and split it based on the length of these characters?

package main

import (
    "fmt"
    "regexp"
)

var reNameBlacklist = regexp.MustCompile(`(&|>|<|\/|:|\n|\r)*`)
var maxFileNameLength = 30

// SanitizeName sanitizes user names in an email
func SanitizeName(name string, limit int) string {

    result := name
    reNameBlacklist.ReplaceAllString(result, "")
    if len(result) > limit {
        result = result[:limit] + "..."
    }
    return result
}



func main() {
  str := "H㐀〾▓朗퐭텟şüöžåйкл¤"
    fmt.Println(str)

    strsan := SanitizeName(str, maxFileNameLength)
    fmt.Println(strsan)

}
Milo Christiansen
  • 3,204
  • 2
  • 23
  • 36
Sakib
  • 1,503
  • 4
  • 26
  • 39

4 Answers4

11

Slicing strings treats them as their underlying byte array; the slice operator operates on indexes of bytes, not of runes (which can be multiple bytes each). However, range over a string iterates on runes - but the index returned is of bytes. This makes it fairly straightforward to do what you're looking for (full playground example here):

func SanitizeName(name string, limit int) string {
    name = reNameBlacklist.ReplaceAllString(name, "")
    result := name
    chars := 0
    for i := range name {
        if chars >= limit {
            result = name[:i]
            break
        }
        chars++
    }
    return result
}

This is explained in further detail on the Go blog


Update:

As commenters below suggest, you can normalize arbitrary UTF8 to NFC (Normalization Form Canonical Composition), which combines some multi-rune forms like diacritics into single-rune forms where possible. This adds a single step using golang.org/x/text/unicode/norm. Playground example of this here: https://play.golang.org/p/93qxI11km2f

func SanitizeName(name string, limit int) string {
    name = norm.NFC.String(name)
    name = reNameBlacklist.ReplaceAllString(name, "")
    result := name
    chars := 0
    for i := range name {
        if chars >= limit {
            result = name[:i]
            break
        }
        chars++
    }
    return result
}
Adrian
  • 42,911
  • 6
  • 107
  • 99
  • The one difference from the question's code is the "..." when the limit kicks in. I was tempted to strip blacklisted chars from the _shortened_ string, but then you either change the meaning (`santitize(">>>abc", 3)` becomes `"..."` instead of `"abc..."`) or have to complicate the code. – twotwotwo Sep 26 '17 at 01:06
  • Our current logic strips the string first which is why I kept the truncating afterwards – Sakib Sep 26 '17 at 01:17
  • This answer is great. The only problem is, it doesn't take Unicode grapheme clusters into account (a grapheme cluster is the combination of a base rune like `A` and any extra runes that supply things like diacritics on top of it). It might be useful to normalize the string to NFC before truncating it, so that as many characters as possible are converted into their single-rune variants. – Lassi Dec 08 '18 at 13:00
  • The code above also uses `ReplaceAllString` but does not use the returned string and the `name` variable is never updated. So none of the characters in the regex are replaced. The same issue is present in the answer by @Topo – Uberswe Feb 11 '20 at 12:54
  • Good catch, my bad for copy & pasting from the question. Fixed. – Adrian Feb 11 '20 at 14:41
  • Normalization of the string to NFC as recommended by @Lassi can be done with `norm.NFC.String(string)` in `golang.org/x/text/unicode/norm` package. – jarnoan May 18 '21 at 10:13
3

The reason your data is getting corrupted is because some characters use more than one byte and you are splitting them. To avoid this Go has type rune which represents a UTF-8 character. You can just cast the string to a []rune like this:

func SanitizeName(name string, limit int) string{   
    reNameBlacklist.ReplaceAllString(name, "")
    result := []rune(name)
    // Remove the special chars here
    return string(result[:limit])
}

This should only leave the first limit UTF-8 characters.

Topo
  • 4,783
  • 9
  • 48
  • 70
  • 1
    Adrian's approach avoids allocating the four bytes per Unicode code point, and does less work when the input string is long, so I'd go with that. – twotwotwo Sep 26 '17 at 00:49
  • 2
    This is by far the simplest way, but it does have some drawbacks. For short strings however, the drawbacks are a minor issue at worst. – Milo Christiansen Sep 26 '17 at 07:27
  • 1
    This implementation breaks the string on codepoint boundaries (which is still better than breaking on byte boundaries) but this is not enough to break on character boundaries as a character may be represented by several codepoints. – dolmen Feb 21 '19 at 10:27
  • One optimization: the cost of conversion to runes can be avoided for short strings whose number of bytes (`len`) is lower than the limit. – dolmen Feb 21 '19 at 10:38
2

Another option is the utf8string package:

package main
import "golang.org/x/exp/utf8string"

func main() {
   s := utf8string.NewString("")
   t := s.Slice(0, 2)
   println(t == "")
}

https://pkg.go.dev/golang.org/x/exp/utf8string

Zombo
  • 1
  • 62
  • 391
  • 407
2

Below is a function that truncates a UTF-8 encoded string to a maximum number of bytes, without corrupting the last rune.

// TruncateUTF8String truncate s to n bytes or less. If len(s) is more than n,
// truncate before the start of the first rune that doesn't fit. s should
// consist of valid utf-8. 
func TruncateUTF8String(s string, n int) string {
    if len(s) <= n {
            return s
    }
    for n > 0 && !utf8.RuneStart(s[n]) {
            n--
    }
    return s[:n]
}

I assume the goal is to limit either the number of bytes to store or the number of characters to display. This function is to limit the number of bytes before storing them. It is faster than counting runes or characters, since it only checks the last few bytes.

Note: Multi-rune characters are not considered, so it might cut a character that is a combination of multiple runes. Example: café can become cafe. I don't' know if this problem can be completely avoided, but it can be reduced by performing Unicode normalization before truncation.

Lars Christian Jensen
  • 1,407
  • 1
  • 13
  • 14