8

What is the best way to perform, inside a regex, negation of multiple words and permutations of chars that make up those words?

For instance: I do not want

"zero dollar"
"roze dollar"
"eroz dollar"
"one dollar"
"noe dollar"
"oen dollar"

but I do want

"thousand dollar"
"million dollar"
"trillion dollar"

If I write

not m/ [one | zero] \s dollar /

it will not match permutations of chars, and the "not" function outside will make the regex match everything else like "big bang" without the "dollar" in the regex.

m/ <- [one] | [zero] > \s dollar/ # this is syntax error.
halfer
  • 19,824
  • 17
  • 99
  • 186
lisprogtor
  • 5,677
  • 11
  • 17
  • FWIW, the way to avoid that syntax error is by writing that as `/ <!after one | zero> \s dollar/` –  Mar 01 '17 at 18:32
  • There should be a new _tag_ category `perl6 regexen` so people can distinguish from PCRE 5 types. –  Mar 01 '17 at 21:41
  • @sln The `regex` tag's description says "all questions with this tag should also include a tag specifying the applicable programming language or tool." Aiui this solves the problem you raise (not just for PCRE regex vs non-PCRE regex but also one PCRE flavor vs another), provided questioners follow the admonition to add a lang/tool tag (or editors do it for them). I believe this is true for all Perl 6 regex questions. Perhaps [tag intersection searching](http://meta.stackexchange.com/questions/231693/better-support-for-search-by-both-intersection-and-union-of-multiple-tags) needs improvement? – raiph Mar 02 '17 at 16:49
  • @raiph - I think there is some compatibility mode for Perl6 regex that enables Perl5. But, Perl5 style regex constructs permeate to %90 of other engines' syntax. That's why regex tag is mostly a default for Perl5 style. It's too big of a leap to have regex qualified with perl6, since it's mostly standalone in regex land. –  Mar 02 '17 at 17:24
  • Thank you sin and raiph !! I prefer perl6 regexes. – lisprogtor Mar 03 '17 at 05:55
  • and thank you Zoffix !! – lisprogtor Mar 03 '17 at 06:04
  • @sln Fwiw I think the current approach and advice in the tag works reasonably well and I'd be surprised if you get consensus on supporting introduction of a Perl 6 specific regex tag. But maybe I'm missing something. Presumably meta is the right forum if you wish to push this issue further. Please point folk to my comment above as a counterpoint to your own view if you decide to push for this new tag elsewhere and then reply here again if there's agreement we should change to a separate tag. TIA. – raiph Mar 03 '17 at 06:47

2 Answers2

8

Using a code assertion:

You could match any word, and then use a <!{ }> assertion to reject words that are permutations of "one" or "zero":

say "two dollar" ~~ / :s ^ (\w+) <!{ $0.comb.sort.join eq "eno" | "eorz" }> dollar $ /;

Using before/after:

Alternatively, you could pre-generate all permutations of the disallowed words, and then reject them using a <!before > or <!after > assertion in the regex:

my @disallowed = <one zero>.map(|*.comb.permutations)».join.unique;

say "two dollar" ~~ / :s ^ <!before @disallowed>\w+ dollar $ /;
say "two dollar" ~~ / :s ^ \w+<!after @disallowed> dollar $ /;
smls
  • 5,738
  • 24
  • 29
6

Here's a solution that works well. It uses a helper-sub is-bad-word that compares the $needle (i.e. what it found in the target string) against the @badwords and if any matches, it'll return True.

Inside the regex itself, I've used a negative code-assertion that passes the (\w+) that was matched into the helper sub.

One important thing to point out: If you don't properly anchor the (\w+) to the beginning of a word (i chose beginning of the string this time) it will just skip ahead one character when it found a bad word and accept anyway (unless the bad word was only one character to begin with, like in a dollar). After all, zero is in your @badwords, but ero isn't.

Hope that helps!

my @badwords = <one zero yellow>;

my @parsefails = q:to/EOF/.lines;
    zero dollar
    roze dollar
    erzo dollar
    one dollar
    noe dollar
    oen dollar
    yellow dollar
    wolley dollar
    EOF

my @parsepasses = q:to/EOF/.lines;
    thousand dollar
    million dollar
    dog dollar
    top dollar
    meme dollar
    EOF

sub is-bad-word($needle) {
    return $needle.comb.sort eq any(@badwords).comb.sort
}

use Test;
plan @parsefails + @parsepasses;

for flat (@parsefails X False), (@parsepasses X True) -> $line, $should-pass {
    my $succ = so $line ~~ / ^ (\w+) \s <!{ is-bad-word($0.Str) }> 'dollar' /;
    ok $succ eqv $should-pass, "$line -> $should-pass";
}

done-testing;
timotimo
  • 4,299
  • 19
  • 23
  • of course, you may want to fold-case (`.fc`) both sides of the `eq` in is-bad-word if you're interested in also disallowing One dollar. – timotimo Mar 01 '17 at 18:23
  • 1
    Thank you timotimo !! You have packed many concepts that I need to do more learning. And learn Perl6 I will; be with you the force may !! – lisprogtor Mar 03 '17 at 06:03