Regex matching a pattern that doesn't include another pattern

Question

I understand the process for a regex that only contains numbers but how would i add another condition to that such that it cannot contain a certain substring. For example, a regex that match input that contains only numbers, but not the substring 456.

Given this input (where <empty> is the empty string ""):

0
1456
<empty>
12345689
1010101
abc

These and only these should matche:

0
<empty>
1010101

Could somebody explain the regex for this?

This can be done, with difficulty. But why do you want to? It's much easier to write"normal" code to make this check. — slim, Dec 13 '16 at 22:07
I'm prepping for a test, this was a question on a previous years test. Explicitly stated must be written in regex — unity1989, Dec 13 '16 at 22:10
I do not believe that there is a single regular expression that can express this. — bmargulies, Dec 13 '16 at 22:10
@bmargulies Plain regex (i.e. even dialects without "fancy" features like non-greedy matchers) compiles to a fully fledged finite state machine. Since you only ever need to look three chars ahead to find a mismatch, we know we can solve this with finite states, so it's definitely possible, but it might be long-winded. With non-greedy matchers and negative lookaheads, it can be tersely expressed (although not necessarily easy to understand). — slim, Dec 14 '16 at 15:15

anubhava · Answer 1 · 2016-12-13T22:23:27.337

2

You can use this regex using a negative lookahead:

^(?![0-9]*456)[0-9]*$

RegEx Demo

(?![0-9a-zA-Z]*456) is negative lookahead to disallow 456 in the word.

edited Dec 13 '16 at 22:23

answered Dec 13 '16 at 22:12

anubhava

761,203
64
569
643

Shouldn't match letters nor symbols – unity1989 Dec 13 '16 at 22:15
As with @Bohemian's answer, this is inefficient as it traverses the whole input twice. – slim Dec 14 '16 at 15:59
Good luck with your efficiency theory on `^(?:(?!456)\d)*$` vs `^(?![0-9]*456)[0-9]*$` – anubhava Dec 14 '16 at 16:37
@anubhava Matthew's answer is efficient as it makes a single pass, looking ahead a maximum of 3 chars. And will be even cheaper when the current char isn't '4'. – slim Dec 14 '16 at 17:57
Nope. That is performing lookahead **after every character** but this regex performs lookahead only once at the start. You can check # of steps taken using both regex on regex101 site and you will note higher # of steps using `^(?:(?!456)\d)*$` – anubhava Dec 14 '16 at 18:25
I don't know how that site is counting steps, but it is adding 4 for each additional `[^4]` character in the input, which suggests a poor implementation. The lookahead should cost 1 if the current char is not '4'. However I'm going to backpedal a little. Your implementation costs 2n comparisons (for a match) in all cases, but must sweep the input twice. `*(?!456)\d)*` costs 2n comparisons if there are no '4's, and an extra 1 for every '4' plus 1 for every '45', but it only has to sweep the input once. However, since we're talking about a Java string in memory, a second sweep is cheap ... – slim Dec 14 '16 at 19:18
... I'm afraid I'm conditioned to think about streams, where a second sweep is expensive or impossible. – slim Dec 14 '16 at 19:19

score 2 · Answer 2 · edited Dec 14 '16 at 15:56

2

I think this is what you are looking for:

public static void main(String[] args) {
    String regex = "^((?!456)\\d)*$";
    String test = "123";
    String test2 = "456";
    String test3 = "asdf123";
    String test4 = "test456asdf";

    System.out.println(test.matches(regex)); // True
    System.out.println(test2.matches(regex)); // False
    System.out.println(test3.matches(regex)); // False
    System.out.println(test4.matches(regex)); // False
}

That is:

start of string
zero or more times
- look at the three chars starting here, don't match if it's "456"
- match one digit
end of string

Here's a link to fiddle where you can test the epsilon character as well.

edited Dec 14 '16 at 15:56

slim

40,215
13
94
127

answered Dec 13 '16 at 22:23

Matthew Brzezinski

23
7

but test3 should *not* match: *only contains numbers*. Your code doesn't even pass OP's examples. – Bohemian Dec 13 '16 at 22:35
The character for epsilon is not a number but OP was asking for that to be returned? If that's the case then just replace the '.' With '\\d' – Matthew Brzezinski Dec 13 '16 at 22:44
I changed the `.` to `\d` so it matches only numbers. This makes it the right answer, if we allow ourselves negative lookaheads (it's the solution favoured by Perl Monks -- http://www.perlmonks.org/?node_id=518444 ). I couldn't edit the fiddle. – slim Dec 14 '16 at 15:53
I see the OP is updated now to remove the epsilon character. I just updated the fiddle link as well to use `\d`. Thanks @slim ! – Matthew Brzezinski Dec 14 '16 at 15:55

Bohemian · Answer 3 · 2016-12-13T22:30:36.850

1

Use a negative look ahead anchored to start, and match "numbers":

^(?!.*456)\p{N}*$

edited Dec 13 '16 at 22:30

answered Dec 13 '16 at 22:21

Bohemian

412,405
93
575
722

Inefficient, as it will traverse the whole input twice (although that's fine if you know the input will always be small) – slim Dec 14 '16 at 15:49

slim · Answer 4 · 2016-12-14T16:22:39.770

I think this works without any "fancy" regex features such as negative lookahead.

^([0-35-9]*|4[0-46-9]|45[0-57-9]|4$|45$)*$

That is:

start
- any number of:
  - a sequence of digits not including 4
  - or a 2 char number starting with "4", but not "45"
  - or a 3 char number starting with "45", but not "456"
  - or a 4 followed by end
  - or a 45 followed by end
end

This is in keeping with regex's property of being a finite state machine. We have explicitly dealt with three states - ("Not seen a 4", "Seen a 4", "Seen a 45"). If we wanted our 'not matching' string to be "4567" we'd have to explicitly add another state, making the pattern longer and the state machine bigger.

Whether this meets your needs depends on what the test is looking for -- familiarity with advanced features of Java's regex dialect, or ability to apply regular expressions universally (e.g. basic grep, bash).

Negative lookaheads, allow you to express this more tersely.

^((!?456)\d)*$

That is (with start and end anchors around it), zero or more repetitions of a one-char pattern: (!?456)\d which means "Not the start of 456 (looking ahead without actually consuming) and matches a numeric character."

To process this, the regex engine only ever needs to look 3 chars ahead of the current character, making this an efficient one-pass way of meeting the requirement.

Regex matching a pattern that doesn't include another pattern

4 Answers4