0

Answering haskell-convert-unicode-sequence-to-utf-8 I came upon some strange behaviour of ByteString.putStrLn

{-# LANGUAGE OverloadedStrings #-}

module Main where

import           Data.Text (Text)
import           Data.ByteString (ByteString)
import qualified Data.ByteString.Char8 as B


inputB, inputB' :: ByteString
inputB = "ДЕЖЗИЙКЛМНОПРСТУФ"
inputB' = "test"


main :: IO ()
main = do putStr "B.putStrLn inputB: "; B.putStrLn inputB
          putStr "print inputB: "; print inputB
          putStr "B.putStrLn inputB': "; B.putStrLn inputB'
          putStr "print inputB': "; print inputB'

which yields

B.putStrLn inputB:
rint inputB: "\DC4\NAK\SYN\ETB\CAN\EM\SUB\ESC\FS\GS\RS\US !\"#$"
B.putStrLn inputB': test
print inputB': "test"

what I do not understand here is - why the first output line is missing and the p in print on the second line is missing.

My guess would be that this has something to do with the russian letters leading to malformed input. Because with the simple case of "test" it just works.


Edit

  • Platform: Linux Mint 17.3
  • file-encoding: UTF-8
  • terminal: gnome-terminal/tmux/zsh
  • ghc: 7.10.3
  • stack: 1.0.4

xxd output

> stack exec -- unicode | xxd
00000000: 422e 7075 7453 7472 4c6e 2069 6e70 7574  B.putStrLn input
00000010: 423a 2014 1516 1718 191a 1b1c 1d1e 1f20  B: ............
00000020: 2122 2324 0a70 7269 6e74 2069 6e70 7574  !"#$.print input
00000030: 423a 2022 5c44 4334 5c4e 414b 5c53 594e  B: "\DC4\NAK\SYN
00000040: 5c45 5442 5c43 414e 5c45 4d5c 5355 425c  \ETB\CAN\EM\SUB\
00000050: 4553 435c 4653 5c47 535c 5253 5c55 5320  ESC\FS\GS\RS\US
00000060: 215c 2223 2422 0a42 2e70 7574 5374 724c  !\"#$".B.putStrL
00000070: 6e20 696e 7075 7442 273a 2074 6573 740a  n inputB': test.
00000080: 7072 696e 7420 696e 7075 7442 273a 2022  print inputB': "
00000090: 7465 7374 220a                           test".

libraries

> stack exec -- ghc-pkg list
/opt/ghc/7.10.3/lib/ghc-7.10.3/package.conf.d
   Cabal-1.22.5.0
   array-0.5.1.0
   base-4.8.2.0
   bin-package-db-0.0.0.0
   binary-0.7.5.0
   bytestring-0.10.6.0
   containers-0.5.6.2
   deepseq-1.4.1.1
   directory-1.2.2.0
   filepath-1.4.0.0
   ghc-7.10.3
   ghc-prim-0.4.0.0
   haskeline-0.7.2.1
   hoopl-3.10.0.2
   hpc-0.6.0.2
   integer-gmp-1.0.0.0
   pretty-1.1.2.0
   process-1.2.3.0
   rts-1.0
   template-haskell-2.10.0.0
   terminfo-0.4.0.1
   time-1.5.0.1
   transformers-0.4.2.0
   unix-2.7.1.0
   xhtml-3000.2.1
/home/epsilonhalbe/.stack/snapshots/x86_64-linux/lts-5.5/7.10.3/pkgdb
   text-1.2.2.0
/home/epsilonhalbe/programming/unicode/.stack-work/install/x86_64-linux/lts-5.5/7.10.3/pkgdb

and the locale

> locale
LANG=en_US.UTF-8
LANGUAGE=
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC=de_AT.UTF-8
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY=de_AT.UTF-8
LC_MESSAGES="en_US.UTF-8"
LC_PAPER=de_AT.UTF-8
LC_NAME=de_AT.UTF-8
LC_ADDRESS=de_AT.UTF-8
LC_TELEPHONE=de_AT.UTF-8
LC_MEASUREMENT=de_AT.UTF-8
LC_IDENTIFICATION=de_AT.UTF-8
LC_ALL=
epsilonhalbe
  • 15,637
  • 5
  • 46
  • 74

2 Answers2

1

It is not a terminal problem, rather, the problem happens early in the conversion to ByteString. Remember, because you used OverloadedStrings

inputB = "ДЕЖЗИЙКЛМНОПРСТУФ"

is really shorthand for

inputB = fromString "ДЕЖЗИЙКЛМНОПРСТУФ"::ByteString

which does not convert to a bytestring using UTF8.

If, instead, you want the bytestring to contain utf8 encoded chars, use

import qualified Data.ByteString.UTF8 as BU

inputB = BU.fromString "ДЕЖЗИЙКЛМНОПРСТУФ"

then this will work

B.putStrLn inputB

Why is the "p" on line two missing?

I won't go into detail (because I don't know them), but the behavior is expected.... Because your terminal is expecting UTF8, and the Russian string is not UTF8.

UTF8 uses variable length byte character encodings.... Depending on the first byte in a char, it might expect more. Clearly the last byte in the Russian string started a UTF8 encoding that required more bytes, and the "p" was read in to that char. Your terminal seems to just ignore chars it can't print (mine prints garbage), so both the Russian string and the next char were lost.

You will note that the "p" is in the xxd output.... The terminal just considering it to be part of the unknown chars and not printing it.

jamshidh
  • 12,002
  • 17
  • 31
  • this only answers my question in parts - why does the *p* on the next line is not printed, but thanks for enlighten me on the behavior of `fromString` - i kind of suspected something like this. – epsilonhalbe Feb 29 '16 at 00:41
  • "It is not a terminal problem" -- yes it is. His program is printing all the bytes he expected it to print (namely, the bytes of the first line and the `p` of the second line), but the terminal isn't showing him those bytes. – Daniel Wagner Feb 29 '16 at 18:46
  • @DanielWagner - I am assuming that he wants the output to be utf8, and that the terminal is correct.... But you are correct, it could be the other way around, the program had the correct encoding, and he could set the terminal to that encoding (not sure what default encoding it was using). At any rate, I think my assumption was correct, because converting to utf8 fixed his problem. – jamshidh Feb 29 '16 at 18:50
  • @jamshidh Yeah, perhaps it's just best to say the two things (program+terminal) didn't agree in the expected way, and leave it at that without trying to sort out which one to blame. =P – Daniel Wagner Feb 29 '16 at 18:55
1

Quoting from the documentation of Data.ByteString.Char8 (emphasis mine)

Manipulate ByteStrings using Char operations. All Chars will be truncated to 8 bits. It can be expected that these functions will run at identical speeds to their Word8 equivalents in Data.ByteString.

More specifically these byte strings are taken to be in the subset of Unicode covered by code points 0-255. This covers Unicode Basic Latin, Latin-1 Supplement and C0+C1 Controls.

Cyrillic is not allocated in code points 0x00-0xFF, so encoding issues are to be expected.

I would recommend against Data.ByteString.Char8 unless you are dealing with plain ASCII. Even if latin-1 encoded texts may work in certain environment, the latin-1 encoding is obsolescent and should die.

For handling general strings, use Data.Text instead. Conversion functions from ByteStrings to Text, and vice versa, are provided. Of course, these functions have to depend on some encoding.

chi
  • 111,837
  • 3
  • 133
  • 218