21

So I'm trying to convert a binary to a string. This code:

t = [{<<71,0,69,0,84,0>>}]
String.from_char_list(t)

But I'm getting this when I try this conversion:

** (ArgumentError) argument error
    (stdlib) :unicode.characters_to_binary([{<<70, 0, 73, 0, 78, 0>>}])
    (elixir) lib/string.ex:1161: String.from_char_list/1

I'm assuming the <<70, 0, etc. is likely a list of graphemes (it's the return from an API call and the API is not quite documented) but do I need to specify the encoding somehow?

I know I'm likely missing something obvious (maybe that's not the right function to use?) but I can't seem to figure out what to do here.


EDIT:

For what it's worth, the binary above is the return value of an Erlang ODBC call. After a little more digging I found that the binary in question is actually a "Unicode binary encoded as UTF16 little endian" (see here: http://www.erlang.org/doc/apps/odbc/odbc.pdf pg. 9 re: SQL_WVARCHAR) Doesn't really change the issue but it does add some context.

Onorio Catenacci
  • 14,928
  • 14
  • 81
  • 132

7 Answers7

28

There's a couple of things here:

1.) You have a list with a tuple containing one element, a binary. You can probably just extract the binary and have your string. Passing the current data structure to to_string is not going to work.

2.) The binary you used in your example contains 0, an unprintable character. In the shell, this will not be printed properly as a string, due to the fact that Elixir can't tell the difference between just a binary, and a binary representing a string, when the binary representing a string contains unprintable characters.

3.) You can use pattern matching to convert a binary to a particular type. For instance:

iex> raw = <<71,32,69,32,84,32>>
...> Enum.join(for <<c::utf8 <- raw>>, do: <<c::utf8>>)
"G E T "
...> <<c::utf8, _::binary>> = raw
"G"

Also, if you are getting binary data from a network connection, you probably want to use :erlang.iolist_to_binary, since the data will be an iolist, not a charlist. The difference is that iolists can contain binaries, nested lists, as well as just be a list of integers. Charlists are always just a flat list of integers. If you call to_string, on an iolist, it will fail.

sebisnow
  • 1,671
  • 17
  • 26
bitwalker
  • 9,061
  • 1
  • 35
  • 27
  • I thought that the list containing tuple was an issue but I wanted to give the code exactly as it comes back from the API. I'm guessing the result is in a DBCS. Need to dig into it a bit further. – Onorio Catenacci Mar 19 '14 at 23:49
  • Partially I was hoping there might be something already built into the library that I was missing. Good to know I didn't miss anything. – Onorio Catenacci Mar 20 '14 at 12:17
  • 1
    Yes it does seem like each character is being stored in two bytes, good call! The distinction between binaries and strings and charlists and iolists is confusing at times, but I think there are some changes coming down the pipe that should make it more obvious when to use each one. – bitwalker Mar 20 '14 at 16:50
  • Yep, it's actually a little endian UTF16 binary. See my edit to my question. – Onorio Catenacci Mar 20 '14 at 17:22
  • When I tried <> = <<71,32,69,32,84,32>>, in elixir console, I got ** (MatchError) no match of right hand side value: "G E T " .. Can anyone explain why? – Kshitij Mittal Nov 02 '15 at 10:17
  • Yes, because `<>` only matches on utf8 character, but on the right hand side you have a string with several such characters. To do a partial match you'd need to do `<>`. To build a string of utf-8 characters given an arbitrary binary, you have to match character by character. Looks like I typo'd in my example, I'll fix it. – bitwalker Nov 02 '15 at 20:07
7

I made a function to convert binary to string

def raw_binary_to_string(raw) do
   codepoints = String.codepoints(raw)  
      val = Enum.reduce(codepoints, 
                        fn(w, result) ->  
                            cond do 
                                String.valid?(w) -> 
                                    result <> w 
                                true ->
                                    << parsed :: 8>> = w 
                                    result <>   << parsed :: utf8 >>
                            end
                        end)

  end

Executed on iex console

iex(6)>raw=<<65, 241, 111, 32, 100, 101, 32, 70, 97, 99, 116, 117, 114, 97, 99, 105, 111, 110, 32, 65, 99, 116, 117, 97, 108>>
iex(6)>raw_binary_to_string(raw)
iex(6)>"Año de Facturacion Actual"
Gus
  • 90
  • 1
  • 9
Andres Garcia
  • 81
  • 1
  • 3
5

Not sure if OP has since solved his problem, but in relation to his remark about his binary being utf16-le: for specifically that encoding, I found that the quickest (and to those more experienced with Elixir, probably-hacky) way was to use Enum.reduce:

# coercing it into utf8 gives us ["D", <<0>>, "e", <<0>>, "v", <<0>>, "a", <<0>>, "s", <<0>>, "t", <<0>>, "a", <<0>>, "t", <<0>>, "o", <<0>>, "r", <<0>>]
<<68, 0, 101, 0, 118, 0, 97, 0, 115, 0, 116, 0, 97, 0, 116, 0, 111, 0, 114, 0>>  
|> String.codepoints()
|> Enum.reduce("", fn(codepoint, result) ->
                     << parsed :: 8>> = codepoint
                     if parsed == 0, do: result, else: result <> <<parsed>>
                   end)

# "Devastator"
|> IO.puts()

Assumptions:

  • utf16-le encoding

  • the codepoints are backwards-compatible with utf8 i.e. they use only 1 byte

Since I'm still learning Elixir, it took me a while to get to this solution. I looked into other libraries people made, even using something like iconv at a bash level.

sebisnow
  • 1,671
  • 17
  • 26
user701847
  • 337
  • 3
  • 15
  • "backwards-compatible with utf8" is a misleading. "Representable with ASCII" might be more accurate. [UTF-8](https://en.wikipedia.org/wiki/UTF-8) is backwards-compatible with ASCII (UTF-16 is not), but it also has 2-, 3-, or 4-byte characters. UTF-16 always uses 2 bytes, which is often wasteful. Anyway, I'm an Elixir newb and yeah this is a hack. – Cheezmeister Sep 29 '16 at 04:30
4

Ecto.UUID.load/1 will convert a binary to string and return a tuple:

binary = Ecto.UUID.bingenerate()
<<99, 148, 189, 126, 144, 154, 71, 236, 160, 110, 149, 143, 67, 162, 177, 192>>

Ecto.UUID.load(binary)
{:ok, "6394bd7e-909a-47ec-a06e-958f43a2b1c0"}

credit: https://stackoverflow.com/a/43530427/2091331

Jbur43
  • 1,284
  • 17
  • 38
  • 2
    `convert a binary to string and return a tuple`. No it won't. It will convert a hex encoded uuid into a UUID string, but it will not convert any given binary to a string. Even if it did, it also inserts hyphens. It's only to be used for UUIDs. – Peter R Jun 26 '22 at 05:39
3

The last point definitely does change the issue, and explains it. Elixir uses binaries as strings but assumes and demands that they are UTF8 encoded, not UTF16.

rvirding
  • 20,848
  • 2
  • 37
  • 56
2

In reference to http://erlang.org/pipermail/erlang-questions/2010-December/054885.html

You can use :unicode.characters_to_list(binary_string, {:utf16, :little}) to verify result and store too

IEX eval

iex(1)> y                                                
<<115, 0, 121, 0, 115, 0>>
iex(2)> :unicode.characters_to_list(y, {:utf16, :little})
'sys'

Note : Value printed as sys for <<115, 0, 121, 0, 115, 0>>

1

You can use Comprehensions

    defmodule TestModule do
      def convert(binary) do
        for c <- binary, into: "", do: <<c>>
      end
    end
    TestModule.convert([71,32,69,32,84,32]) |> IO.puts
  • If I pass the original argument `<<71,0,69,0,84,0>>` in this case, this doesn't work. I get an argument error. If I have to take the additional step of converting the binary to a list, you probably want to specify that in your answer as well. – Onorio Catenacci Jan 20 '20 at 14:27