2

I just want to ask a really confusing question and get a really basic answer to how it all works, basically my problem is when I count character lengths in JavaScript and PHP for symbols and emoji's like ‍❤️‍‍ it comes up 11 characters instead of what I think is 1 in its 'true length'.

I would like the code for PHP and JavaScript to simply count the 'true length' a human would see for EVERY character readable by a computer (if that makes sense), so all UTF-8 symbols/characters and emojis.

I've tried using strlen, but that only counts bytes, not characters I think. I've also tried mb_strlen but that doesn't count the true length for emojis.

Thank you, I would also appreciate a simple explanation of how this encoding/unicode system works for different length characters, taking into account characters from other languages e.g. french/hebrew.

Cheers!

Lol Boi
  • 33
  • 8
  • There are many tutorials on Unicode and how the UTF-8, UTF-16, and UTF-32 representations work. It's complicated in no small part by the subjective nature of the word "character". – Pointy Nov 12 '18 at 21:47
  • I'm afraid there isn't a really basic answer to how it all works, you'll have to read up quite a bit. Besides the UTF representations, which define how many bytes you need for a certain codepoint (character), there are complicating factors like combining characters, where you can either use a single "á" character or base character plus accent, both showing as a single letter to the human reader. It's basically the same story for your emoji example, where multiple individual emojis, modifiers and combining marks are displayed as a single graphic. – lenz Nov 12 '18 at 22:35
  • ASCII, UCS-2, UTF-8, UTF-16 and UTF-32 are examples of _encoding_ methods for characters. Of these ASCII and UCS-2 cannot encode all Unicode characters, UTF-8 encodes characters using a variable number of 8 bit values and UTF-16 encodes characters using one or two 16 bit values. JavaScript stores character strings using UTF-16 encoding in memory, but string `length` property and the `charAt` method count 16 bit values as if UCS-2 encoding were used, Then there are character accent **and** emoji modifiers that are Unicode "characters" in themselves. Happy googling! – traktor Nov 12 '18 at 23:39
  • For Javascript, see https://stackoverflow.com/a/51422499/46395. PHP's model is broken and AFAICT there is no library for papering over its deficiencies. – daxim Nov 14 '18 at 16:51

0 Answers0