5

Since the latin-1 (aka ISO-8859-1) character set is embedded in the Unicode character set as its lowest 256 code-points, I'd expect the conversion to be trivial, but I didn't see any latin-1 encoding conversion functions in Data.Text.Encoding which contains only conversion functions for the common UTF encodings.

What's the recommended and/or efficient way to convert between Data.ByteString values encoded in latin-1 representation and Data.Text values?

hvr
  • 7,775
  • 3
  • 33
  • 47
  • 1
    By the way, the assumption that "since the latin-1 character set is embedded in the Unicode character set as its lowest 256 code-points, I'd expect the conversion to be trivial" is unwarranted. There is no reason to expect that the bytestreams resulting from encoding a single codepoint stream in two different encodings should have a trivial relationship to each other. – Daniel Wagner Sep 25 '11 at 14:25
  • @DanielWagner: Yes, I'm aware that in the general case I shouldn't expect this (for instance if `Data.Text` used utf8 as its internal Unicode representation), but the current version of the `Data.Text` library uses UTF16 representation, for which the conversion from latin1 is in fact a trivial conversion consisting in inserting zero octets after or before (depending on whether UTF16LE or UTF16BE is required) each latin1 octet. – hvr Sep 25 '11 at 20:46

1 Answers1

13

The answer is right at the top of the page you linked:

To gain access to a much larger family of encodings, use the text-icu package: http://hackage.haskell.org/package/text-icu

A quick GHCi example:

λ> import Data.Text.ICU.Convert
λ> conv <- open "ISO-8859-1" Nothing
λ> Data.Text.IO.putStrLn $ toUnicode conv $ Data.ByteString.pack [198, 216, 197]
ÆØÅ
λ> Data.ByteString.unpack $ fromUnicode conv $ Data.Text.pack "ÆØÅ"
[198,216,197]

However, as you pointed out, in the specific case of latin-1, the code points coincide with Unicode, so you can use pack/unpack from Data.ByteString.Char8 to perform the trivial mapping from latin-1 from/to String, which you can then convert to Text using the corresponding pack/unpack from Data.Text.

hammar
  • 138,522
  • 17
  • 304
  • 385
  • 2
    not being satisfied with the current options to convert from `ByteString` to `Text` I finally coded up a direct conversion which performs near-optimal and doesn't expose the `IO` monad in its API, see https://github.com/bos/text/pull/18 – hvr Mar 03 '12 at 08:54