4

How can I take a raw string in JavaScript and convert all the escape sequences to their respective characters? In other words, the reverse of String.raw. For example:

unraw("\\x61\\x62\\x63 \\u{1F4A9} \\u0041");
// => "abc  A";

I tried JSON.parse, however it only supports the last format (\\u0041). Neither unescape nor decodeURI are what I am looking for at all.

Ian
  • 5,704
  • 6
  • 40
  • 72

2 Answers2

3

I think you basically have three choices:

  1. Write your own function to do it, handling the various types of escapes that JavaScript allows in strings; or
  2. Leverage the JavaScript parser built into the JavaScript engine where this code is running, which means trusting the content of the string since you have to use new Function (or even eval) to do it, which means opening yourself up to arbitrary code execution; or
  3. Use a parser like Esprima or similar

#1 is a bit of a pain but really not that bad, there aren't that many to handle. #2 has all the usual issues around trusting the string contents not to be nefarious code, since using eval or calling the function new Function creates allows arbitrary code execution. #3 is a fairly heavy solution.

Looking at #1 a bit more closely, EscapeSequence breaks down into:

  • Single character escapes, \ followed by one of '"\bfnrtv.
  • Hex escapes, \xHH where H is a hex digit
  • Unicode escapes, \uHHHH or \u{H+) where, again, H is a hex digit

That's not actually all that bad. Here's a quick-and-dirty:

// Note: This does not implement LegacyOctalEscapeSequence (https://tc39.es/ecma262/#prod-annexB-LegacyOctalEscapeSequence)
function unraw(str) {
    return str.replace(/\\[0-9]|\\['"\bfnrtv]|\\x[0-9a-f]{2}|\\u[0-9a-f]{4}|\\u\{[0-9a-f]+\}|\\./ig, match => {
        switch (match[1]) {
            case "'":
            case "\"":
            case "\\":
                return match[1];
            case "b":
                return "\b";
            case "f":
                return "\f";
            case "n":
                return "\n";
            case "r":
                return "\r";
            case "t":
                return "\t";
            case "v":
                return "\v";
            case "u":
                if (match[2] === "{") {
                    return String.fromCodePoint(parseInt(match.substring(3), 16));
                }
                return String.fromCharCode(parseInt(match.substring(2), 16));
            case "x":
                return String.fromCharCode(parseInt(match.substring(2), 16));
            case "0":
                return "\0";
            default: // E.g., "\q" === "q"
                return match.substring(1);
        }
    });
}
console.log(String.raw`${unraw("\\x61\\x62\\x63 \\u{1F4A9} \\u0041")}`);
// Double-check result
const str =           "\x61\x62\x63 \u{1F4A9} \u0041";
const raw = String.raw`\x61\x62\x63 \u{1F4A9} \u0041`;
console.log(str === unraw(raw));

I'm sure that can be cleaned up a bit.

T.J. Crowder
  • 1,031,962
  • 187
  • 1,923
  • 1,875
  • I think #1 is probably the way to go. I'm surprised this doesn't exist already; I was hoping to find a well-tested library for this. #2 is definitely not an option as I have no idea what the string content will be. This is for an NPM module involving template literal tags so that would be a huge vulnerability. – Ian Aug 02 '19 at 16:31
  • @Ian - Yeah, I was surprised not to find one. I realized that #1 isn't really that much work at all, string literals aren't that complicated. I added a quick-and-dirty version. – T.J. Crowder Aug 02 '19 at 16:42
  • Oops, I missed a couple of things (as you do). Fixed now I hope. – T.J. Crowder Aug 02 '19 at 17:05
  • Thank you! I expanded on that idea a bit and came up with something that hopefully behaves almost exactly like a JS parser (notably I wanted it to throw errors for invalid sequences). I've posted a link in my answer https://stackoverflow.com/a/57332315/1243041 – Ian Aug 02 '19 at 19:19
0

As it appears there's nothing out there, I've written my own, which is a bit more robust than @T.J. Crowder's excellent answer. Notably, I wanted a function that behaves pretty much exactly as the JS parsers process strings, which means it needs to error on invalid codes. This function also properly handles double-escaped sequences like "\\\x61" which should yield "\\x61" and Unicode surrogates, as well as optionally handling octal literals properly or at least throwing errors when encountering them. Finally it supports escaping characters that don't need to be escaped, like "\R".

I haven't had the chance to test it thoroughly yet, but I've uploaded it here https://github.com/iansan5653/unraw and will eventually write extensive unit tests and publish it as an NPM module.

Ian
  • 5,704
  • 6
  • 40
  • 72