14

I read Regular Expression Matching: the Virtual Machine Approach and now I try to parse a regular expression and create a virtual machine from it. The tokenizer works and creates its tokens. After that step, I create the reversed polish notation from the token stream so at the end I get

a b c | |

from the regular expression a|(b|c). Well, now the step where I stuck: I want to get an array

0: split 1, 3
1: match 'a'
2: jump 7
3: split 4, 6
4: match 'b'
5: jump 7
6: match 'c'
7: noop

from the stream above. And I did not get it right... I use an output array and a stack for the start positions of each token. First, the 3 values are added to the output (and it's start positions to the stack).

output              stack
------------------- ------
0: match 'a'        0: 0
1: match 'b'        1: 1
2: match 'c'        2: 2

With |, I pop the last 2 positions from the stack and insert split and jump at the specific positions. The values are calculated based on the current stack length and the amount of elements I add. At the end, I add the new start-position of the last element to the stack (remains the same in this case).

output              stack
------------------- ------
0: match 'a'        0: 0
1: split 2, 4       1: 1
2: match 'b'
3: jump 5
4: match 'c'

That seems ok. Now, the next | is popped...

output              stack
------------------- ------
0: split 1, 3       0: 0
1: match 'a'
2: jump 7
3: split 2, 4
4: match 'b'
5: jump 5
6: match 'c'

And here's the problem. I have to update all the addresses that I calculated before (lines 3 and 5). That's not what I want to. I guess, relative addresses have the same problem (at least if the values are negative).

So my question is, how to create a vm from regex. Am I on the right track (with the rpn-form) or is there another (and/or easier) way?

The output array is stored as an integer array. The split-command needs in fact 3 entries, jump needs two, ...

Seki
  • 11,135
  • 7
  • 46
  • 70
mal-raten
  • 149
  • 5
  • 1
    a regex-virtual-machine tag would be more precise – 1010 May 22 '15 at 13:20
  • I have a very similar project and I don't think you have a chance without recalculating. If you think of a tree structure, you can recursively start a `|`-node with outputting a `split`, process the first child, output the `jump`, process the second child and after returning, update the addresses on `split` and `jump`. It's easy within a tree - but it's still recalculation. – Wolfgang Kluge May 22 '15 at 19:18

4 Answers4

1

It would be easier to use relative jumps and splits instead.

  • a — Push a match to the stack

    0: match 'a'
    
  • b — Push a match to the stack

    0: match 'a'
    --
    0: match 'b'
    
  • c — Push a match to the stack

    0: match 'a'
    --
    0: match 'b'
    --
    0: match 'c'
    
  • | — Pop two frames from the stack, and instead push split <frame1> jump <frame2>

    0: match 'a'
    --
    0: split +1, +3
    1: match 'b'
    2: jump +2
    3: match 'c'
    
  • | — Pop two frames from the stack, and instead push split <frame1> jump <frame2>

    0: split +1, +3
    1: match 'a'
    2: jump +5
    3: split +1, +3
    4: match 'b'
    5: jump +2
    6: match 'c'
    

If you really need absolute jumps instead, you could easily iterate through and adjust all offsets.

Markus Jarderot
  • 86,735
  • 21
  • 136
  • 138
0

I think that instead of setting address during the processing, you can store a reference to the command to which you want to jump, and in output array you also store the references (or pointers). After all processing is complete, you go along the generated output and assign the indices based on the actual position of the referenced command in the resulting array.

Petr
  • 9,812
  • 1
  • 28
  • 52
  • This might work - but I really don't like the idea of it... ;) Please don't get me wrong. Thanks for answering - I just hope there's another, direct way – mal-raten May 22 '15 at 16:27
0

RPN is not the best way to build up the output you need. If you built an AST instead, then it would be easy to generate the codes with a recursive traversal.

Lets say you had an OR node, for example, with two children "left" and "right". Each node implements a method generate(OutputBuffer), and the implementation for an OR node would look like this:

void generate(OutputBuffer out)
{
int splitpos = out.length();
out.append(SPLIT);
out.append(splitpos+3); //L1
out.append(0); //reservation 1
//L1
left.generate(out);
int jumppos = out.length();
out.append("JUMP");
out.append(0); //reservation 2
//L2
out.set(splitpos+2, out.length()); //reservation 1 = L2
right.generate(out);
//L3
out.set(jumppos+1, out.length()); //reservation 2 = L3
}
Matt Timmermans
  • 53,709
  • 3
  • 46
  • 87
  • Oops... I just noticed this question is 6 months old! I hope you figured it out somehow before now. I have an open source project that implements the DFA method. I personally think that way is much cooler, but all of these regex implementations are lots of fun – Matt Timmermans Oct 27 '15 at 03:02
0

FWIW, Cox posted an implementation of his approach here. You might find it helpful as a reference.

James Davis
  • 848
  • 8
  • 12