Convert javascript string length to array of byte[]

Question

I need to convert in javascript the length of a string to an array of bytes with length 2 (16-bit signed integer) equivalent to C# Bitconverter.GetBytes( short value).

Example: 295 -> [1,39].

What does [1,39] mean ? it's not even UTF8. I assume you mean it is UTF16BE. And a Javascript string even of length 1 may not necessarily contain 1 complete Unicode character. If it contains an unicode surrogate, it is not a character but just a 16-bit code unit which is not convertible to UTF16BE (where it should first be paired with a second surrogate to be convertible to 4 bytes in UTF16BE, and 4 different bytes in UTF8). In Javascript, strings are not restricted to UTF16, they are arbitrary vectors of 16-bit code units, not always convertible to any UTF without exceptions or replacements. — verdy_p, Sep 11 '19 at 08:40
Possible duplicate of [JavaScript simple BitConverter](https://stackoverflow.com/questions/49951290/javascript-simple-bitconverter) — aloisdg, Sep 11 '19 at 08:46
`[ value >> 8 & 0xFF, value & 0xFF ];` returns the 2 lower bytes of a number. — Thomas, Sep 11 '19 at 09:04
@Thomas you are correct, however as there are methods to do this, and a `Buffer` strictly is a byte array (which is what he asked for) it is better to use the library functions. — Euan Smith, Sep 11 '19 at 09:07
@EuanSmith your answer doesn't give the result i was looking for. (295 >> 8 & 0xFF, 295 & 0xFF) -> [1,39] const buf = Buffer.alloc(2); buf.writeUInt16BE(295, 0); console.log(buf); . I think Thomas's answer is what i was looking for. — Ciprian Stanciu, Sep 11 '19 at 09:13
@CiprianStanciu try `console.log(Array.from(buf))` will show it in decimal — Euan Smith, Sep 11 '19 at 09:17
@EuanSmith It's not that simple when switching between languages that are so different. Although Buffer is technically a byte array, it is not an array and JS doesn't know anything about bytes. Except for these wrapper types that under the hood *represent* bytes. Like Buffer in node, like ArrayBuffer in the Browser, and like Blob. I commented a simple approach that may help the OP to achieve whatever he's trying to do, even if it doesn't answer his exact question. — Thomas, Sep 11 '19 at 09:18
Yep, it's clear for me now, you were both right. :) Thank you very much guys! — Ciprian Stanciu, Sep 11 '19 at 09:19
@Thomas you may be right. It does depend on what is being done with the data afterwards. — Euan Smith, Sep 11 '19 at 09:20

Euan Smith · Accepted Answer · 2019-09-11T09:44:33.933

0

As you are using node, a Buffer is what you need. Check out the documentation here. For example:

//Make a string of length 295
const st="-foo-".repeat(59);
//Create a byte array of length 2
const buf = Buffer.alloc(2);
//Write the string length as a 16-bit big-endian number into the byte array
buf.writeUInt16BE(st.length, 0);
console.log(buf);
//<Buffer 01 27> which is [1, 39]

Be aware that this will give you the string length in characters, not the byte length of the string - the two may be the same but that is not guaranteed.

edited Sep 11 '19 at 09:44

answered Sep 11 '19 at 08:38

Euan Smith

2,102
1
16
26

Note that if the string contains a single surrogate, the result would still be invalid UTF16BE if it contains 2 bytes, unless the surrogate was replaced by some other replacement character in the BMP like U+FFFD or '?' before converting it to two bytes in UTF16BE. So don't assume that the result of buf.writeUInt16BE() is valid UTF16BE. It's just a binary array, not necessarily text in any valid Unicode form, but usable to recreate a javascript string without conversion loss. As well don't use the result as UTF8: it will be garbled and may be valid UTF8 but invalid HTML/XML with null bytes. – verdy_p Sep 11 '19 at 08:50
@verdy_p His initial question was just to, literally, convert a string length to a 2-byte representation, not the string itself. So while you are correct, this is not what he asked. I added the part about utf8 etc basically because this is something he might be doing given his question, however strictly speaking the question was just about conversion from a UInt16 to an array of bytes NOT characters, which is what a `Buffer` is. – Euan Smith Sep 11 '19 at 08:55
note that Buffer.from("\u0B95") returns a 3-bytes buffer, because it assumes a default encoding in UTF-8, but writeUInt16BE() does NOT generate UTF-8 output. The correct buffer length for writeUInt16BE() is the Javascript string length multiplied by 2 (independantly of what the string contains). But the result is never a valid Unicode encoding (Unicode validation is not performed). It's just an alternate binary representation of the binary javascript string content, using ordered 8-bit bytes instead of 16-bit code units. – verdy_p Sep 11 '19 at 08:55
@verdy_p The questioner has not asked for UTF-8 output, just for a byte array. – Euan Smith Sep 11 '19 at 08:56
And so `Buffer.from(string)` may not allocate anything and could return an error if the string is not valid UTF16 (i.e. if it contains unpaired surrogates) which is not convertible to UTF8 and thus has no defined buffer length. – verdy_p Sep 11 '19 at 08:58
Your suggestion of Buffer.from(string) is then irrelevant ! – verdy_p Sep 11 '19 at 08:59
@verdy_p Fair point. I wanted to make the point that they needed to understand exactly what they meant by string length. If they are coming from C (not sure which is the case for C#) they may be assuming a character and a byte are synonymous, which as you quite rightly say they are not. – Euan Smith Sep 11 '19 at 09:04

score 0 · Answer 2 · answered Sep 11 '19 at 09:22

Finally, a javascript string content may eventually be larger 64KB or larger, so its string length after conversion to bytes may not fit in a 16-bit integer or 2 bytes. The minimum code should first check for the string length (throw an error if st.length>=32768), then use a 2-bytes buffer (from Buffer.alloc(2)) before outputing to the buffer the string length with buffer.writeUInt16BE(st.length).

It's generally a bad idea to make code that cannot handle correct string contents: a text with 32768 characters or more is not at all exceptional. But it may be correct if the text comes from a database field with length limitations, not if the text comes from user input in an HTML input form field, which would first require validation (even if that form was using VALID UTF-8, that the validator should still check: don't assume that the "user" is using a browser, it could be a malicious bot trying to break your webapp to harvest security holes and get some privileges or steal private data.

On the web, input validation (valid encoding, length limits, text format) from submitted form data is required for ALL input fields (including combo selectors and radio buttons or checkboxes) before ALL processings (but you may want to discard all unknown fields with incorrect names). Make sure that your processing can handle all text lengths that pass the validators of your webapp (including self-contained applications using an embedded web component, like mobile apps or desktop apps, not just web browsers).

IMHO, it's just bad to use such 16-bit assumption, but your question is valid if you make sure that validators and implemented and length constraints are already checked before.

Convert javascript string length to array of byte[]

2 Answers2