3

The String.length/1 function returns the number of graphemes in a UTF-8 binary.

If I want to know how many Unicode codepoints are in the string, I know I can do:

string |> String.codepoints |> length

But this produces an unnecessary intermediate list of all the codepoints and iterates the characters twice. Is there a way I can calculate the codepoints directly, without producing the intermediate list?

Adam Millerchip
  • 20,844
  • 5
  • 51
  • 74

1 Answers1

3

You can use a comprehension with a bitstring generator and the reduce option to count the codepoints without building up the intermediate list.

for <<_::utf8 <- string>>, reduce: 0, do: (count -> count + 1)

Example:

iex> string = "‍♂️"
iex> for <<_::utf8 <- string>>, reduce: 0, do: (count -> count + 1)
5
iex> string |> String.codepoints |> length
5
iex> String.length(string)
1

This has the added bonus that it also works with UTF-16 and UTF-32 strings, if you replace utf8 with utf16 or utf32:

iex> utf8_string = "I'm going to be UTF-16!"
"I'm going to be UTF-16!"
iex> utf16_string = :unicode.characters_to_binary(utf8_string, :utf8, :utf16)
<<0, 73, 0, 39, 0, 109, 0, 32, 0, 103, 0, 111, 0, 105, 0, 110, 0, 103, 0, 32, 0,
  116, 0, 111, 0, 32, 0, 98, 0, 101, 0, 32, 0, 85, 0, 84, 0, 70, 0, 45, 0, 49,
  0, 54, 0, 33>>
iex> for <<_::utf8 <- utf8_string>>, reduce: 0, do: (count -> count + 1)
23
iex> for <<_::utf16 <- utf16_string>>, reduce: 0, do: (count -> count + 1)
23
Adam Millerchip
  • 20,844
  • 5
  • 51
  • 74
  • 1
    I ran a benchmark between the two, and using your comprehension is 85% faster than calling `Kernel.length/1` – vinibrsl Jun 20 '21 at 21:48
  • @vinibrsl good idea. But I couldn't reproduce your results. How did you benchmark? I just ran one, and in my results `to_charlist` was actually faster and used less memory than the comprehension, but there wasn't much in it. Perhaps I need to update this answer... https://gist.github.com/adamu/911b3754736f1c584e45222ba9a4c107 – Adam Millerchip Jun 21 '21 at 03:59
  • https://gist.github.com/vinibrsl/707a4a64737fab8b6a19f4620d8fd47a – vinibrsl Jun 21 '21 at 13:06
  • Ah, longer string Results in my script for that string: `charlist 107.57 μs, comprehension 256.59 μs - 2.39x slower, codepoints 725.87 μs - 6.75x slower`. So it seems while the comprehension is faster that `codepoints`, `to_charlist` is even faster (and uses less memory). Not sure why? I'll revisit this when I have time. – Adam Millerchip Jun 21 '21 at 13:19
  • @vinibrsl actually even for your script, I get `Codepoints took 0.6852230000000012 seconds Comprehension took 0.24642800000000145 seconds`. So something's up with your setup, maybe? – Adam Millerchip Jun 21 '21 at 13:21