Javascript Unicode Conversion and Search

Question

I'm wondering whether anyone has any insight on converting an array of character codes to Unicode characters, and searching them with a regex.

If you have

var a = [0,1,2,3]

you can use a loop to convert them into a string of the first four control characters in unicode.

However, if you then want to create a regex

"(X)+"

where X == the character code 3 converted to its Unicode equivalent, the searches never seem to work. If I check for the length of the string, it's correct, and .* returns all the characters in the string. But I'm having difficulties constructing a regex to search the string, when all I have to begin with is the character codes. Any advise?

Edit:

var a = [0,1,2,3,0x111]; str = "";

for(var i = 0; i < a.length; i++) {
    str += String.fromCharCode(a[i]);
}

var r = [0x111]
var reg = ""

reg += "(";
for(var i = 0; i < r.length; i++) {
var hex = r[i].toString(16);
    reg += "\\x" + hex;
}
reg += ")";

var res = str.match(RegExp(reg))[0];

Edit

//Working code:
var a = [0,1,2,3,0x111];
str = "";

for(var i = 0; i < a.length; i++) {
    str += String.fromCharCode(a[i]);
}

var r = [3,0x111]
var reg = ""

reg += "(";
for(var i = 0; i < r.length; i++) {
    var hex = r[i].toString(16);
    reg += ((hex.length > 2) ? "\\u" : "\\x") + ("0000" + hex).slice((hex.length > 2) ? -4 : -2);
}
reg += ")";

var res = str.match(RegExp(reg))[0];

Can you post just a few line code example of exactly what you are trying to do -- a minimal example that we can look at, rather than guess at? In its current form, it is very difficult to answer this question. — Jeremy J Starcher, May 11 '14 at 02:22
I hope the above edit is sufficient to get the basic idea across, though in the actual application it will be significantly more sophisticated. — AaronF, May 11 '14 at 05:18

Joel Allison · Accepted Answer · 2014-05-11T16:16:26.000

2

With changes to a few details, the example can be made to work.

Assuming that you are interested in printable Unicode characters in general, and not specifically the first four control characters, the test vector a for the string "hello" would be:

var a = [104, 101, 108, 108, 111]; // hello

If you want to match both 'l' characters:

var r = [108, 108]

When you construct your regular expression, the character code must be in hexadecimal:

reg += "\\x" + ("0" + r[i].toString(16)).slice(-2);

After that, you should see the results you expect.

edited May 11 '14 at 16:16

answered May 11 '14 at 05:51

Joel Allison

2,091
1
12
9

Right, so when I stringify the regex with either method, I get (\x2\x3). That tests negative against a string containing the first four unicode characters, while (.)* tests positive for the whole string. – AaronF May 11 '14 at 15:35
Hardcoding (\x02\x03) tests positive, so I guess a single character hex doesn't cut it. Any thoughts on formatting hex strings like this? – AaronF May 11 '14 at 15:42
Edited to perform padding with leading zero. If using `\uxxxx` in your regular expression (where `xxxx` is four hexadecimal digits for the Unicode character), the leading pad string would have three zeroes. – Joel Allison May 11 '14 at 16:19
Thanks a lot for your help! You're padding solution is genious. – AaronF May 11 '14 at 16:47
So, I generate a regex like (\x000002\x000003) and compare it against the string containing the first four unicode characters in order. It returns negative even though (\x02\x03) returns positive. Any idea why? – AaronF May 11 '14 at 16:53
Use `\xhh` when you have two hex digits and `\uhhhh when you have four hexadecimal digits. See https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_Expressions. – Joel Allison May 11 '14 at 16:56
1

I posted the working code as an edit. Thanks for all your help! – AaronF May 11 '14 at 17:13

Javascript Unicode Conversion and Search

1 Answers1