2

I have a Text object that contains some number of Latin characters that needs to be converted to a unicode escape sequence of the format \u#### with # being hex digits

As described here, haskell easily converts strings to escape sequences and vice versa. However, it will only go to the decimal representation. For example,

> let s = "Ñ"
> s
"\209"

Is there a way to specify the escape sequence encoding to force it to spit out in the correct format? i.e

> let s = encodeUnicode16 "Ñ"
> s
"\u00d1"
ljedrz
  • 20,316
  • 4
  • 69
  • 97
jkeuhlen
  • 4,401
  • 23
  • 36
  • Trying to understand what you're asking for here, you're looking for 1) a function `Text -> Text` that 2) you are assuming will only be passed a value containing `Data.Char.isLatin1` matching `Char`s in order to 3) replace `not . Data.Char.isAscii` `Char`s with `Text` strings containing that `Char`'s c-style four digit form backslash-u encoding representation? E.g., `f "El Niño" = "El Ni\\u00f1o"`, nb the escaped backslash. – R B Aug 30 '16 at 20:48
  • @RowanBlush That is an excellent summary, yes. – jkeuhlen Aug 30 '16 at 21:37

1 Answers1

4

How about this:

import Text.Printf (printf)

encodeUnicode16 :: String -> String
encodeUnicode16 = concatMap escapeChar
  where
    escapeChar c
        | ' ' <= c && c <= 'z' = [c]
        | otherwise =
            printf "\\u%04x" (fromEnum c)

I ghci, you can use it as follows:

> putStrLn $ encodeUnicode16 "Ñ"
\u00d1

Note that if you don't use putStrLn it will get escaped twice:

> encodeUnicode16 "Ñ"
"\\u00d1"

This is because ghci will implicitly add a print in front of the command.

Edit: I missed that part that you have a Text and not a String. Here's the same code for Text:

import Data.Text (Text)
import qualified Data.Text as T
import qualified Data.Text.IO as T
import Text.Printf (printf)

encodeUnicode16 :: Text -> Text
encodeUnicode16 = T.concatMap escapeChar
  where
    escapeChar c
        | ' ' <= c && c <= 'z' = T.singleton c
        | otherwise =
            T.pack $ printf "\\u%04x" (fromEnum c)

Again, you want to use T.putStrLn to avoid double escaping everything.

redneb
  • 21,794
  • 6
  • 42
  • 54