0

I have problems with character encoding in haskell. This simple program write wrong results. What I am really interested here is encode function which forces me to use ByteString. Application is:

import Data.ByteString.Char8 (unpack, pack)
import Data.ByteString.Lazy (toStrict)
import Data.Csv (encode) -- cabal install cassava

main = do
    -- (middle character is polish diacritic letter)
    putStrLn $ unpack $ pack "aća"
    putStrLn $ unpack $ toStrict $ encode ["aća"]

It should print

aća
a,ć,a

but instead it writes

aa
a,Ä,a

This breaks my application encoding CSV. This happen on Linux no matter of my locale settings

$ locale
LANG=pl_PL.UTF-8
LC_CTYPE="pl_PL.UTF-8"
LC_NUMERIC="pl_PL.UTF-8"
LC_TIME="pl_PL.UTF-8"
LC_COLLATE="pl_PL.UTF-8"
LC_MONETARY="pl_PL.UTF-8"
LC_MESSAGES="pl_PL.UTF-8"
LC_PAPER="pl_PL.UTF-8"
LC_NAME="pl_PL.UTF-8"
LC_ADDRESS="pl_PL.UTF-8"
LC_TELEPHONE="pl_PL.UTF-8"
LC_MEASUREMENT="pl_PL.UTF-8"
LC_IDENTIFICATION="pl_PL.UTF-8"
LC_ALL=pl_PL.UTF-8

or

$ locale
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=

What I want to know is how to convert output of encode (Data.ByteString.Lazy.ByteString) to String so I can write it to file using e.g. writeFile function.

Trismegistos
  • 3,821
  • 2
  • 24
  • 41
  • 3
    `ByteString` doesn't care about encoding, it just read bytes. Have you tried using `Data.Text` instead? – bheklilr May 01 '15 at 14:47
  • @bheklilr I have added import fot toString. – Trismegistos May 01 '15 at 14:48
  • 3
    You want to use `Text` instead. `Data.ByteString` will truncate any `Char` to a `Char8`. – Zeta May 01 '15 at 14:49
  • Also, know that in the second case you are calling `encode` on a list of `Char`s, not on a list of `ByteString`s, so they aren't really equivalent. I would highly recommend using `Data.Text` instead of `Data.ByteString` for this application. – bheklilr May 01 '15 at 14:50
  • @bheklilr So how should I modify this file so encoding works correctly -- I am insterested in encode function but since it uses ByteString I have added sample code showing ByteString dropping letters. – Trismegistos May 01 '15 at 14:50
  • More to the point, `ByteString` outputs everything as Latin-1, which will result in certain characters not being valid UTF-8. – MathematicalOrchid May 01 '15 at 14:50
  • @Zeta I do not have much choice about data type. It is enforced by encode function which I use to encode to csv. – Trismegistos May 01 '15 at 14:51
  • 1
    @Trismegistos Look at [Data.Text.Encoding](http://hackage.haskell.org/package/text-1.1.1.3/docs/Data-Text-Encoding.html) for functions to convert from `ByteString` to a proper `Text` encoding. – bheklilr May 01 '15 at 15:09
  • @bheklilr When I tried to use data.text haskell said 'HCi runtime linker: fatal error: I found a duplicate definition for symbol _hs_text_memcpy whilst processing object file /home/wolk/.cabal/lib/text-1.2.0.4/ghc-7.6.3/libHStext-1.2.0.4.a' This library is colliding with almost every other library I use e.g. Data.String.Utils, cassava. – Trismegistos May 01 '15 at 15:36
  • @Trismegistos Looks like you might have `text` and `bytestring` installed improperly (welcome to [cabal hell](http://stackoverflow.com/questions/25869041/whats-the-reason-behind-cabal-dependency-hell)). Try installing them both in a [sandbox](https://www.haskell.org/cabal/users-guide/installing-packages.html#developing-with-sandboxes) and compiling with those versions. – bheklilr May 01 '15 at 15:39
  • @bheklilr What does it mean installed improperly? How could it happen I haven't done anything but called cabal install libname? – Trismegistos May 01 '15 at 15:41

1 Answers1

3

You should simply use Data.ByteString.Lazy.putStr rather than putStrLn . unpack . toStrict. No need to go through Text.

Data.ByteString.Char8.unpack converts the byte with value n to the Unicode code point with value n. Don't use it on (non-ASCII) UTF-8 encoded text!

Edit: I see you say you want to convert the result of encode to a String to write it to a file. Don't do that, use the IO functions like Data.ByteString.Lazy.writeFile instead.

Reid Barton
  • 14,951
  • 3
  • 39
  • 49
  • That worked but what if I want to convert bytesting to string or text to process it furgher? – Trismegistos May 01 '15 at 19:41
  • @Trismegistos you should use the functions in the package `utf8-string`, the documentation is at http://hackage.haskell.org/package/utf8-string, and I think it's installed by default with the Haskell platform. – Jeremy List May 19 '15 at 07:44