I read Regular Expression Matching: the Virtual Machine Approach and now I try to parse a regular expression and create a virtual machine from it. The tokenizer works and creates its tokens. After that step, I create the reversed polish notation from the token stream so at the end I get
a b c | |
from the regular expression a|(b|c)
.
Well, now the step where I stuck: I want to get an array
0: split 1, 3
1: match 'a'
2: jump 7
3: split 4, 6
4: match 'b'
5: jump 7
6: match 'c'
7: noop
from the stream above. And I did not get it right... I use an output array and a stack for the start positions of each token. First, the 3 values are added to the output (and it's start positions to the stack).
output stack
------------------- ------
0: match 'a' 0: 0
1: match 'b' 1: 1
2: match 'c' 2: 2
With |
, I pop the last 2 positions from the stack and insert split
and jump
at the specific positions. The values are calculated based on the current stack length and the amount of elements I add.
At the end, I add the new start-position of the last element to the stack (remains the same in this case).
output stack
------------------- ------
0: match 'a' 0: 0
1: split 2, 4 1: 1
2: match 'b'
3: jump 5
4: match 'c'
That seems ok. Now, the next |
is popped...
output stack
------------------- ------
0: split 1, 3 0: 0
1: match 'a'
2: jump 7
3: split 2, 4
4: match 'b'
5: jump 5
6: match 'c'
And here's the problem. I have to update all the addresses that I calculated before (lines 3 and 5). That's not what I want to. I guess, relative addresses have the same problem (at least if the values are negative).
So my question is, how to create a vm from regex. Am I on the right track (with the rpn-form) or is there another (and/or easier) way?
The output array is stored as an integer array. The split
-command needs in fact 3 entries, jump
needs two, ...