5

Consider the following examples (λ> = ghci, $ = shell):

λ> writeFile "d" $ show "d"
$ cat d
"d"

λ> writeFile "d" "d"
$ cat d
d

λ> writeFile "backslash" $ show "\\"
$ cat backslash
"\\"

λ> writeFile "backslash" "\\"
$ cat backslash
\

λ> writeFile "cat" $ show "" -- U+1F408
$ cat cat
"\128008"

λ> writeFile "cat" ""
$ cat cat

I understand that another way of "\128008" is just another way of representing "" in Haskell source code. My question is: why does the "" example behave like the backslash instead of like "d"? Since it is a printable character, shouldn't it behave like a letter?

More generally, what is the rule to determine whether the character will be shown as a printable character or as an escape code? I looked at Section 6.3 in the Haskell 2010 Language report but it doesn't specify the exact behaviour.

typesanitizer
  • 2,505
  • 1
  • 20
  • 44
  • Related: [How to hack GHCi (or Hugs) so that it prints Unicode chars unescaped?](https://stackoverflow.com/q/5535512/2682729). – typesanitizer Feb 17 '18 at 22:23

1 Answers1

7

TL:DR; Printable characters inside the ASCII range (0-127) will be shown as graphic characters.* Everything else will be escaped.

* Except for double quotes (as they're used for string delimiters) and backslashes (because they're needed for escaping).

Let's have a look at the source code to figure this one out!

Since we have String = [Char], we should hunt for instance Show Char in the source. It can be found here. It is defined as:

-- | @since 2.01
instance  Show Char  where
    showsPrec _ '\'' = showString "'\\''"
    showsPrec _ c    = showChar '\'' . showLitChar c . showChar '\''

    showList cs = showChar '"' . showLitString cs . showChar '"'

So showing a String (using showList) is basically a wrapper around ShowLitString, and showing a Char is a wrapper around ShowLitChar. Let's look at those functions.

showLitString :: String -> ShowS
-- | Same as 'showLitChar', but for strings
-- It converts the string to a string using Haskell escape conventions
-- for non-printable characters. Does not add double-quotes around the
-- whole thing; the caller should do that.
-- The main difference from showLitChar (apart from the fact that the
-- argument is a string not a list) is that we must escape double-quotes
showLitString []         s = s
showLitString ('"' : cs) s = showString "\\\"" (showLitString cs s)
showLitString (c   : cs) s = showLitChar c (showLitString cs s)
   -- [explanatory comments ...]

As you might've expected, showLitString is mostly a wrapper around showLitChar. [Note: If you're unfamiliar with the ShowS type, this is a good answer to understand why it might be useful.] Not quite what we were looking for, so let us go to showLitChar (I've omitted parts of the definition which aren't relevant to the question).

-- | Convert a character to a string using only printable characters,
-- using Haskell source-language escape conventions.  For example:
-- [...]
showLitChar                :: Char -> ShowS
showLitChar c s | c > '\DEL' =  showChar '\\' (protectEsc isDec (shows (ord c)) s)
-- ^ Pattern matched for cat
showLitChar '\DEL'         s =  showString "\\DEL" s
showLitChar '\\'           s =  showString "\\\\" s
-- ^ Pattern matched for backslash
showLitChar c s | c >= ' '   =  showChar c s
-- ^ Pattern matched for d
-- Some more escape codes
showLitChar '\a'           s =  showString "\\a" s
-- similarly for '\b', '\f', '\n', '\r', '\t', '\v' etc.
-- showLitChar ... = ...

Now you see where the problem is. ord c is an int, and the first is taken for all non-ASCII characters (ord '\DEL' == 127). For characters in the ASCII range, the printable characters are printed and the rest are escaped. For characters outside it, all of them are escaped.

The code doesn't answer the "why" part of the question. The answer to that (I think) is in the very first comment that we saw:

-- | @since 2.01
instance  Show Char  where

If I were guessing, this behaviour has been kept around for maintain backwards compatibility. I don't need to guess: see the comments for some good answers to this.

Bonus

We can do a git blame online using GHC's Github mirror ;). Let's see when this code was written (blame link). The relevant commit is 15 years old (!). However, it does mention Unicode.

The functionality to distinguish between different types of Unicode characters is present in the Data.Char module. Looking at the source:

isPrint    c = iswprint (ord c) /= 0

foreign import ccall unsafe "u_iswprint"
  iswprint :: Int -> Int

If you trace the commit which introduced iswprint, you'll land up here. That commit was made 13 years ago. Maybe there was sufficient code written in those two years which they didn't want to break? I don't know. If some GHC developer could shed more light on this, that'd be awesome :). Daniel Wagner and Paul Johnson in the comments have pointed out a very good reason for this - operating with non-Unicode systems must've been a high priority (~15 years ago) as Unicode was relatively new back then.

typesanitizer
  • 2,505
  • 1
  • 20
  • 44
  • 4
    The output of `Show` is 7-bit clean, a nice property for inclusion in/attachment to email because that means it doesn't need an extra round of escaping, nice because ASCII is available almost everywhere, and nice because many popular encodings agree on what bytes to use for 7-bit clean text. – Daniel Wagner Feb 17 '18 at 22:38
  • Is that just a useful side effect of the current implementation or one of the actual justifications for having it the way it is? – typesanitizer Feb 17 '18 at 23:01
  • That is the folklore justification that I have heard repeated often. I don't have a canonical source to support it, though. – Daniel Wagner Feb 18 '18 at 00:42
  • 3
    15 years ago being compatible with non-unicode systems was more important than it is now. However there is also the bonus that when you read an escaped string you can see exactly what is there, including control characters. That is useful because different strings can wind up looking the same when printed. The escaped version lets you know what is really going on. – Paul Johnson Feb 18 '18 at 10:25
  • @Paul Johnson, good point! Having never really written code that needs to interop with old code, I didn't think of that. – typesanitizer Feb 18 '18 at 19:30