1

I have a string and I need to make sure that it contains only a regular expression and no javascript because I'm creating a new script with the string so a javascript snippet would be a security risk.

Exact scenario:

  1. JS in mozilla addon loads configuration as json through HTTPrequest (json contains {"something": "^(?:http|https)://(?:.*)"}
  2. JS creates a pac file(proxy configuration script) that uses the "something" regex from the configuration

Any ideas how to escape the string without destroying the regex in it?

Malte Goetz
  • 792
  • 6
  • 16
  • Regexes aren't regular. I doubt it will be possible to match a regex with a regex accurately. – Amal Murali Aug 28 '14 at 15:35
  • So, the string is read/converted from a file? That means you can't separate regex constructs from anything else. –  Aug 28 '14 at 15:43
  • As stated above the string is from a json file loaded through a httprequest. But because of security concerns from Mozilla (makes also sense to me) I need to make sure that the string really contains a regex and nothing else. If it would contain javascript instead of the regex, the js would be executed in the pac file. – Malte Goetz Aug 28 '14 at 15:50
  • So what's first, the string or the json file? When and how could JS be executed? –  Aug 28 '14 at 16:41
  • The json file is the source of the string! The JS could be executed because I generate a new pac-script (proxy config) inside my script with the string. – Malte Goetz Aug 28 '14 at 16:49
  • Your best bet is to cull the string to the regex part. Whether that is a key/value pair I don't know. –  Aug 28 '14 at 16:56

2 Answers2

1

You can use a regular expression to pull apart a JavaScript regular expression.

Then you should convert the regex to a lexically simpler subset of JavaScript that avoids all the non-context-free weirdness about what / means, and any irregularities in the input regex.

var REGEXP_PARTS = "(?:"
    // A regular character
    + "[^/\r\n\u2028\u2029\\[\\\\]"
    // An escaped character, charset reference or backreference
    + "|\\\\[^\r\n\u2028\u2029]"
    // A character set
    + "|\\[(?!\\])(?:[^\\]\\\\]|\\\\[^\r\n\u2028\u2029])+\\]"
    + ")";

var REGEXP_REGEXP = new RegExp(
    // A regex starts with a slash
    "^[/]"
    // It cannot be lexically ambiguous with a line or block comemnt
    + "(?![*/])"
    // Capture the body in group 1
    + "(" + REGEXP_PARTS + "+)"
    // The body is terminated by a slash
    + "[/]"
    // Capture the flags in group 2
    + "([gmi]{0,3})$");

 var match = myString.match(REGEXP_REGEXP);

 if (match) {
   var ctorExpression =
       "(new RegExp("
         // JSON.stringify escapes special chars in the body, so will
         // preserve token boundaries.
         + JSON.stringify(match[1])
         + "," + JSON.stringify(match[2])
       + "))";
   alert(ctorExpression);
 }

which will result in an expression that is in a well-understood subset of JavaScript.

The complex regex above is not in the TCB. The only part that needs to function correctly for security to hold is the ctorExpression including the use of JSON.stringify.

Mike Samuel
  • 118,113
  • 30
  • 216
  • 245
0

It seems that most of the standard JavaScript functionality is available (source), so you can just do:

try {
    RegExp(json.something+'');
    pacFile += 'RegExp(' + JSON.stringify(json.something+'') + ')';
} catch(e) {/*handle invalid regexp*/}

And not worry, because a RegExp("console.log('test')") will only produce a valid /console.log('test')/ regexp and execute nothing.

Volune
  • 4,324
  • 22
  • 23