1

I have installed nette/tokenizer https://packagist.org/packages/nette/tokenizer in a sandbox project to play with tokenization.

"basic" rules definition

The most basic test is to play with the example they give here: https://doc.nette.org/en/3.1/tokenizer#toc-string-tokenization made of these 3 rules:

    $languagePatterns[ 'basic' ] = [
        'number' => '\d+',
        'whitespace' => '\s+',
        'string' => '\w+',
    ];

I call this language "basic" and this is the input of my test:

There were 101 dalmatians
There were   101      dalmatians

I paint each token with colors in the output, it works well:

tokenization with language "basic"

"parentheses" rules definition

I now want to play with identifying "blocks in parentheses" and leave the rest equal. For example for this input:

There were (more than) 101 dalmatians

the block (more than) should be one token and the rest tokenized as in the basic language (distinguishing character words, whitespacing and numbers). So the output should be:

"string":     "There"
"whitespace": " "
"string":     "were"
"whitespace": " "
"group":      "(more than)"
"whitespace": " "
"number":     "101"
"whitespace": " "
"string":     "dalmatians"

So I leave the rules I had and I add one new rule group like this:

    $languagePatterns[ 'parentheses' ] = [
        'number' => '\d+',
        'whitespace' => '\s+',
        'string' => '\w+',
        'group' => '\((.*)\)',
    ];

And it works:

tokenization with language "parentheses"

Problem

I now want to tokenize as this:

  • "Parentheses blocks" is one token
  • "anything else" is another token.

For example the input

There were (more than) 101 dalmatians

Should be tokenized as:

"anything":   "There were "
"group":      "(more than)"
"anything":   " 101 dalmatians"

And I try rules like those:

    $languagePatterns[ 'parentheses2' ] = [
        'whatever' => '.*',
        'group' => '\((.*)\)',
    ];

and I get a full match:

Parentheses 2 - full match

Or just in case the order of the rules have an impact, I try those (first match the parentheses block, then anything else):

    $languagePatterns[ 'parentheses2' ] = [
        'group' => '\((.*)\)',
        'whatever' => '.*',
    ];

and then I don't match anything.

Parentheses 2 - no match

Question

I wonder what rules should I place in PHP for the nette/tokenizer to behave as desired.

Xavi Montero
  • 9,239
  • 7
  • 57
  • 79
  • `'whatever' => '[^()]+'` – hakre Sep 05 '21 at 07:46
  • Tested and it works! Could you please set it as an answer in order to select it? In addition for all readers, if you could be so kind to express what it does, it would be fantastic. – Xavi Montero Sep 05 '21 at 09:47
  • XRef: I raised this question trying to solve this other question https://stackoverflow.com/questions/68988193/complex-text-substitution-algorithm-or-design-pattern – Xavi Montero Sep 05 '21 at 09:48
  • XRef: Adding a reference to an old question by @Lone Learner that is also seeking how to match "everything else" https://stackoverflow.com/questions/27217075/how-to-tokenize-using-regular-expression-such-that-regex-for-everything-else-d – Xavi Montero Sep 05 '21 at 10:10

2 Answers2

0

This tokenization code attempts to return all matches found with an expression.

Using

$languagePatterns[ 'basic' ] = [
    'anythingbutparens' => '[^()]+',
];

you will achieve what is needed, any strings excluding parens.

EXPLANATION

  [^()]+                   any character except: '(', ')' (1 or more
                           times (matching the most amount possible))
Ryszard Czech
  • 18,032
  • 4
  • 24
  • 37
0

When you define a token, take care that the pattern does not match nothing (an empty string). That as a precaution as it can be easy to overlook when tinkering with token patterns. E.g. .*, especially with the short quantifier * for zero or more {0,}.

Theirs group token is immune to that:

        'group' => '\((.*)\)',

As it contains the parenthesis, so it is at least two characters long.

However theirs whatever token is not:

        'whatever' => '.*',

It can match both too little or too much

As the Nette Tokenzier looks like to have a first-wins strategy, putting it above group matches nothing and putting it below group matches everything (as tokenizing starts at the first character).

Instead of making it match nothing, you could "undefine" the parenthesis characters by not tokenizing them additionally:

        'whatever' => '[^()]+',

That is matching any character but ( and ) - at least once (+ is the short quantifier for one or more {1,}).

Speaking of tokenizing this may answer your first-hand question already, however the question itself should highlight an additional issue: Some strings will fail to be fully tokenized. As this Nette Tokenizer example shows, there can be no match which should be a warning sign.

E.g. what if there is an unclosed group? Or the string starts within the middle of what could be a group?

You may want to consider matching each of the parenthesis characters separately and construe the hierarchy (e.g. of a group) on the token stream afterwards. But even if not, the tokens should parse the whole character stream:

        'group' => '\((.*)\)',
        'text' => '[^()]+',
        'parenthesis_open' => '\(',
        'parenthesis_close' => '\)',
        'any' => '.+',

This has the highest-order token first (group), then spare open and close parenthesis and a terminating any/catch-all token to match everything else.

If the token-stream contains the any token you can easily identify parts that were perhaps not intended. If you leave the parenthesis_* tokens out, you could find those there.

The last example should make visible as well that

        'group' => '\((.*)\)',

can also catch too much, e.g. not pairing parenthesis. It's perhaps possible to formulate that with PCRE based regex patterns (it can support recursion), however technically that is doing a lot in the tokenization. Not a strict rule, but tokenization is a divide and conquer strategy often, so break it down with simple tokens and do the parsing (e.g. what construes a group) on the token stream later.

Just by tokenization, it is perhaps better to limit it:

        'group' => '\(([^()]*)\)',

A group contains nothing or text. You can see the pattern.

hakre
  • 193,403
  • 52
  • 435
  • 836