What is a safe length of JavaScript strings?

Question

Considering charAt(), charCodeAt(), and codePointAt() I find a discrepancy between what the parameter means. Before I really thought about it I thought you would always be safe to access the character at length-1. But I read the difference between charCodeAt() and codePointAt() is that charCodeAt() refers to 16-bit (byte pairs) so besides reading i you would also need i+1 if they were surrogate pairs (as is the methodology with UTF-16). Whereas codePointAt() needs a parameter that references the UTF-8 character position (zero based). So now I'm in a quandary as to whether length counts the number of characters, or the number of byte pairs UTF-16 style. I believe JavaScript holds strings as UTF-16, but using length-1 from that on a string that had lots of 4-byte characters with the codePointAt() function would be off the end of the string!!

Strings can be any length, so long as there is enough memory. — StackSlave, Mar 10 '17 at 02:37

Bergi · Accepted Answer · 2017-03-10T02:42:04.490

3

The length of strings is counted in 16-bit unsigned integer values (“elements”) or code units (which together form a valid or invalid UTF16 code unit sequence), and so are its indices. We might also call them "characters".

It doesn't matter whether you access them as properties or via charAt, chatCodeAt and codePointAt, length - 1 will always be a valid index. A code point might however be encoded as a surrogate pair spanning two indices. There is no builtin method to measure the number of these, but the default string iterator will yield them so you can count them using a for … of loop.

edited Mar 10 '17 at 02:42

answered Mar 10 '17 at 02:30

Bergi

630,263
148
957
1,375

Please propose a solution for the OP's question "what is the safe length". – Tatsuyuki Ishi Mar 10 '17 at 02:32
Thanks for your answer. Very disappointed though that JavaScript is so useless that it can provide an illegal return value if you happen to give it an index for the second of surrogate pairs. – Clive Mar 10 '17 at 02:52
@Clive What do you mean by "illegal"? It's just the code unit at that index, irregardless of what bytes might be in front of it. But yes, JavaScript strings are immutable `Uint16Array`s instead of Unicode character lists. – Bergi Mar 10 '17 at 03:04
@Bergi I called it illegal because of the substring 'char' in all 3 of these functions' names. By their names they purport to give the code of the 'character', not the code of the _16-bit unsigned integer values_. – Clive Mar 10 '17 at 03:11
@Bergi. "irregardless" - wow, is that a Bushism? (like "misunderestimated") hehe – Clive Mar 10 '17 at 03:15

score 2 · Answer 2 · answered Mar 10 '17 at 02:29

2

Use [...str].length for the count of character.

var mb = "";
console.log(mb.length);
console.log([...mb].length); // "real" length (ES6)
console.log(mb.charAt(0)); // The first two byte
console.log(mb.codePointAt(0)); // The first two byte
console.log(mb.codePointAt(1)); // The second two byte
console.log(mb.charCodeAt(0)); // The four bytes combined (ES6)
console.log(mb.charCodeAt(1)); // The second two byte (ES6)

answered Mar 10 '17 at 02:29

Tatsuyuki Ishi

3,883
3
29
41

I'm presuming your mb character as set is a non-basic multiligual plane character. Thank you for your answer and the included source. But I am disillusioned about JavaScript's `length` property, it seems to not state how many characters there are at all. I didn't know about the elipses. – Clive Mar 10 '17 at 02:48
I recommend to use `Array.from(…)` for casting iterables to arrays, spread syntax should only be used as part of literals. – Bergi Mar 10 '17 at 03:09

What is a safe length of JavaScript strings?

2 Answers2