2

I am trying to put together a proof of concept of an XSS-safe string interpolation scheme.

Given a string with substitutions,

"Hello <b>$planetoid</b>!"

I want break it into literal portions and substitutions ("Hello<b>" planetoid "</b>!") and then run a state machine left to right over the literal portions. When I reach an interpolated value (planetoid in the above), I need to be able to get from the state to an appropriate escaping function.

Does anyone know of any examples of how to use lex/yacc/bison to derive a state machine and be able to associate labels in the grammar with output states? I want to derive a state machine that I can use both in javascript, and to try and replace PHP's underlying string implementation.

My reasons for doing this are described here.

cheers, mike

Mike Samuel
  • 118,113
  • 30
  • 216
  • 245
  • bison actually generates a state machine, and you can add actions for every reduction that takes place during parsing, so actually what you need is to replace $vars inside a string? Or what? – Jack Nov 05 '09 at 23:59

3 Answers3

8

In general, it is not possible to create a state machine for grammars that can be represented in BNF. State machines can only recognize regular languages and BNF can specify context-free languages. Yacc can create parsers. Would that be sufficient?

PeterAllenWebb
  • 10,319
  • 3
  • 37
  • 44
  • +1, more precise than me about chomsky hierarchy, I should study it again. – Jack Nov 06 '09 at 00:09
  • I want to do this purely at a lexical level. HTML and CSS have regular grammars. JS does not have a regular lexical grammar, but I can come up with a lexically regular subset if I can just transition to an error state when I see something too widgy. – Mike Samuel Nov 06 '09 at 00:10
  • I think if you're willing to cut corners just searching for $ followed by a valid variable name and then replacing would suffice in most cases. Also, crazy as it might seem, HTML is not a regular language, since its nesting level is potentially unlimited. – PeterAllenWebb Nov 06 '09 at 00:22
  • I did not mean to say is regular -- the lexical grammar is regular. I have no problem finding interpolations. What I want is a state machine that I can use to determine the appropriate escaping convention for a substitution at a point in the language. I want to create a combined HTML/CSS/JS grammar, assign names to certain states, derive a state machine, and use the assigned names to map states to escaping functions. When I run the state machine over "<" foo " href=" bar ">" I want to know that when I see foo, I'm in a EXPECT_TAGNAME state, and bar is in a EXPECT_ATTR_VALUE. – Mike Samuel Nov 06 '09 at 01:03
  • Ah. I see your point now. But I think that you'll have trouble doing this with a state machine alone. You will probably have to parse the HTML, CSS, and JS separately. Fortunately there are already tools for that. – PeterAllenWebb Nov 06 '09 at 13:40
0

It looks like I can put markers in the grammar, so if I use two different production types, one with no side effect, and that consumes characters, and one that consumes no characters, but updates a state variable

ST_EXPECT_TAG_NAME : { state = TAG_NAME };
TAG_BODY
    : '<' ST_EXPECT_TAG_NAME TAG_NAME ATTRS SPACES '>' ST_OUT_OF_TAG
    ;

the compiled output associates state names in a switch statement

YY_REDUCE_PRINT (yyn);
switch (yyn)
  {
      case 118:
#line 74 "tmp/html-combo.y"
    { state = TAG_NAME ;}
    break;

There may be a way to extract the tables without parsing C, but I'm so ignorant of yacc/bison.

Mike Samuel
  • 118,113
  • 30
  • 216
  • 245
0

You can use yacc/bison for this. Initially looking at bison its hard to tell where you can implement a state machine. Rules in bison are resolved left-right-up. Ie; If you have a rule (called rule0) which derives rule1 rule2 rule3, the actions are called in this order: rule1,rule2,rule3,rule0. You can combine that using a global state machine, with dynamic return values for rules (I use a union with different types like a string, an int or even a container for return values).

deek0146
  • 962
  • 6
  • 20