How to truly count UTF-8 characters, and emoji's and special characters with different character lengths?

Question

I just want to ask a really confusing question and get a really basic answer to how it all works, basically my problem is when I count character lengths in JavaScript and PHP for symbols and emoji's like ‍❤️‍‍ it comes up 11 characters instead of what I think is 1 in its 'true length'.

I would like the code for PHP and JavaScript to simply count the 'true length' a human would see for EVERY character readable by a computer (if that makes sense), so all UTF-8 symbols/characters and emojis.

I've tried using strlen, but that only counts bytes, not characters I think. I've also tried mb_strlen but that doesn't count the true length for emojis.

Thank you, I would also appreciate a simple explanation of how this encoding/unicode system works for different length characters, taking into account characters from other languages e.g. french/hebrew.

Cheers!

There are many tutorials on Unicode and how the UTF-8, UTF-16, and UTF-32 representations work. It's complicated in no small part by the subjective nature of the word "character". — Pointy, Nov 12 '18 at 21:47
I'm afraid there isn't a really basic answer to how it all works, you'll have to read up quite a bit. Besides the UTF representations, which define how many bytes you need for a certain codepoint (character), there are complicating factors like combining characters, where you can either use a single "á" character or base character plus accent, both showing as a single letter to the human reader. It's basically the same story for your emoji example, where multiple individual emojis, modifiers and combining marks are displayed as a single graphic. — lenz, Nov 12 '18 at 22:35
ASCII, UCS-2, UTF-8, UTF-16 and UTF-32 are examples of _encoding_ methods for characters. Of these ASCII and UCS-2 cannot encode all Unicode characters, UTF-8 encodes characters using a variable number of 8 bit values and UTF-16 encodes characters using one or two 16 bit values. JavaScript stores character strings using UTF-16 encoding in memory, but string `length` property and the `charAt` method count 16 bit values as if UCS-2 encoding were used, Then there are character accent **and** emoji modifiers that are Unicode "characters" in themselves. Happy googling! — traktor, Nov 12 '18 at 23:39
For Javascript, see https://stackoverflow.com/a/51422499/46395. PHP's model is broken and AFAICT there is no library for papering over its deficiencies. — daxim, Nov 14 '18 at 16:51

How to truly count UTF-8 characters, and emoji's and special characters with different character lengths?

0 Answers0