How to exclude characters from a RegEx pattern with category property codes?

Question

There is a number of category property codes (see part "Unicode character properties"), that can be used for a Perl-compatible Regular Expression (PCRE)

I defined a regex pattern (named subpattern), that should match letters (\p{L}), numbers (\p{N}), the space separator (\p{Zs}), but also the punctuation (\p{P}).

(?<sport>[\p{L}\p{N}\p{Zs}\p{P}]*)

Since I'm using that for URL routing, the slashes should be excluded. How can I do that?

EDIT:

Addtitional information about the context: The pattern is used for a route definition in a Zend Framework 2 module.

/Catalog/config/module.config.php

<?php
return array(
    ...
    'router' => array(
        'routes' => array(
            ...
            'sport' => array(
                'type'  => 'MyNamespace\Mvc\Router\Http\UnicodeRegex',
                'options' => array(
                    'regex' => '/catalog/(?<city>[\p{L}\p{Zs}]*)/(?<sport>[\p{L}\p{N}\p{Zs}\p{P}]*)',
                    'defaults' => array(
                        'controller' => 'Catalog\Controller\Catalog',
                        'action'     => 'list-courses',
                    ),
                    'spec'  => '/catalog/%city%/%sport%',
                ),
                'may_terminate' => true,
                'child_routes' => array(
                    'courses' => array(
                    'type'  => 'segment',
                        'options' => array(
                            'route' => '[/page/:page]',
                            'defaults' => array(
                                'controller' => 'Catalog\Controller\Catalog',
                                'action'     => 'list-courses',
                            ),
                        ),
                        'may_terminate' => true,
                    ),
                )
            ),
        ),
    ),
    ...
);

Could you add some examples of strings you'd like to apply the regexp to and which results you want to have, please? — Aleksei Zyrianov, Apr 26 '13 at 15:15
Sure: `Aikido`, `Aerobic, Sportaerobic`. The URI _can_ go on after the sport title (e.g. `Aikido/page/2` or `Aerobic, Sportaerobic/page/2`), so the RegEx parser should make a stop by the slash. — automatix, Apr 26 '13 at 15:23

score 3 · Accepted Answer · answered Apr 26 '13 at 15:32

You can use negative look-ahead to exclude some character from your character set. For your example:

(?<sport>(?:(?!/)[\p{L}\p{N}\p{Zs}\p{P}])*)

Basically, you will check that the next character is not / with negative look-ahead (?!/), before proceeding to check whether that character belongs to the character set [\p{L}\p{N}\p{Zs}\p{P}].

PCRE doesn't have set intersection or set difference feature, so this is the work-around for that.

Tom Regner · Answer 2 · 2013-04-26T15:45:34.570

0

Since you use it for URL parsing:

According to RFC 1738 only $-_.+!*'(), are allowed unencoded in an URL¹, so instead of using \pP (yes that is allowed instead of \p{P}), I suggest you use these characters directly in your regex.

Edit: But if that's not an option, this should be a starting point

(?:([\p{L}\p{N}\p{Zs}\p{P}]+?)(?=/|\?|#|$))

kind regards, Tom

¹: Not entirely true, but /@#;?&= are only allowed unencoded if they should have their special meaning.

edited Apr 26 '13 at 15:45

answered Apr 26 '13 at 15:26

Tom Regner

6,856
4
32
47

There are later RFCs that allows for i18n of ~~URL~~ domain name, where Unicode characters are allowed in the URL. Don't know if OP has to process them or not. – nhahtdh Apr 26 '13 at 15:29
Thank you for your answer! But is there a way to let the pattern so and only exclude the slash? Something like `(?[\p{L}\p{N}\p{Zs}\p{P}^/]*)`. – automatix Apr 26 '13 at 15:30

How to exclude characters from a RegEx pattern with category property codes?

2 Answers2