I have installed nette/tokenizer https://packagist.org/packages/nette/tokenizer in a sandbox project to play with tokenization.
"basic" rules definition
The most basic test is to play with the example they give here: https://doc.nette.org/en/3.1/tokenizer#toc-string-tokenization made of these 3 rules:
$languagePatterns[ 'basic' ] = [
'number' => '\d+',
'whitespace' => '\s+',
'string' => '\w+',
];
I call this language "basic" and this is the input of my test:
There were 101 dalmatians
There were 101 dalmatians
I paint each token with colors in the output, it works well:
"parentheses" rules definition
I now want to play with identifying "blocks in parentheses" and leave the rest equal. For example for this input:
There were (more than) 101 dalmatians
the block (more than)
should be one token and the rest tokenized as in the basic language (distinguishing character words, whitespacing and numbers). So the output should be:
"string": "There"
"whitespace": " "
"string": "were"
"whitespace": " "
"group": "(more than)"
"whitespace": " "
"number": "101"
"whitespace": " "
"string": "dalmatians"
So I leave the rules I had and I add one new rule group
like this:
$languagePatterns[ 'parentheses' ] = [
'number' => '\d+',
'whitespace' => '\s+',
'string' => '\w+',
'group' => '\((.*)\)',
];
And it works:
Problem
I now want to tokenize as this:
- "Parentheses blocks" is one token
- "anything else" is another token.
For example the input
There were (more than) 101 dalmatians
Should be tokenized as:
"anything": "There were "
"group": "(more than)"
"anything": " 101 dalmatians"
And I try rules like those:
$languagePatterns[ 'parentheses2' ] = [
'whatever' => '.*',
'group' => '\((.*)\)',
];
and I get a full match:
Or just in case the order of the rules have an impact, I try those (first match the parentheses block, then anything else):
$languagePatterns[ 'parentheses2' ] = [
'group' => '\((.*)\)',
'whatever' => '.*',
];
and then I don't match anything.
Question
I wonder what rules should I place in PHP for the nette/tokenizer to behave as desired.