3

Example string:

"\u0410\u043b\u0435\u043a\u0441\u0430\u043d\u0434\u0440\u044b! \n\u0421\u043f\u0430\u0441\u0438\u0431\u043e \ud83d\udcf8 link.ru \u0437\u0430 \n#hashtag  Русское слово, an English word"

Without this \ud83d\udcf8 my func works well:

func convertUnicode(text string) string {
    s, err := strconv.Unquote(`"` + text + `"`)
    if err != nil {
        // Error.Printf("can't convert: %s | err: %s\n", text, err)
        return text
    }
    return s
}

My question is how to detect that text contains this kind of entries? And how to convert it to emoji or how to remove from the text? Thanks

nobilik
  • 736
  • 9
  • 29
  • Possible duplicate of [How to detect when bytes can't be converted to string in Go?](https://stackoverflow.com/questions/34861479/how-to-detect-when-bytes-cant-be-converted-to-string-in-go) – RayfenWindspear Oct 18 '18 at 17:55

1 Answers1

3

Well, probably not so simple as neither \ud83d nor \udcf8 are valid code points but together are a surrogate pair used in UTF-16 encoding to encode \U0001F4F8. Now strconv.Unquote will give you two surrogate halves which you have to combine yourself.

  1. Use strconv.Unquote to unquote as you did.
  2. Convert to []rune for convenience.
  3. Find surrogate pairs with unicode/utf16.IsSurrogate.
  4. Combine surrogate pairs with unicode/utf16.DecodeRune.
  5. Convert back to string.
Volker
  • 40,468
  • 7
  • 81
  • 87