4

I am at the point where I am banging my head against my desk, to the amusement of my colleagues. I currently have the following regex

(^[\w](( \w+)|(\w*))*[\w]$)|(^\w$)

What I want it to do is match any string which contains only alphanumeric characters, no leading or trailing whitespace and no more than one space between words.

A word in this case is defined as one or more alphanumeric characters.

This matches most of what I want, however from testing it also thinks the second word onwards must be of 2 characters or more in length.

Tests:

ABC - Pass
Type 1 - Fail
Type A - Fail
Hello A - Fail
Hello Wo - Pass
H A B - Fail
H AB - Pass
AB H - Fail

Any ideas where I'm going wrong?

Jon Taylor
  • 7,865
  • 5
  • 30
  • 55
  • @Bergi: This should be an answer. It doesn't get any simpler and better than this (well, OK, you could use a non-capturing group). – Tim Pietzcker Mar 04 '13 at 15:28
  • I was not sure whether all these capturing groups in the OPs complicated version might have been intended… – Bergi Mar 04 '13 at 15:38
  • @Bergi tbh I was just trying anything at this point and the regex just kept getting bigger and bigger haha. Your answer is exactly what I was intending. Thanks – Jon Taylor Mar 04 '13 at 15:40

3 Answers3

9

Your regex is close. The cause of your two-character problem is here:

(^[\w](( \w+)|(\w*))*[\w]$)|(^\w$)
       right here ---^

After matching the group ( \w+), i.e. a space followed by one or more \w, which every word after the first must match because of the space, you then have another mandatory \w -- this is requiring the final word in the string to have two or more characters. Take that one out and it should be fine:

(^[\w](( \w+)|(\w*))*$)|(^\w$)

A simpler version would be:

^\w+( \w+)*$
  • 2
    Naw, these things can be tricky and messy. It always helps to have a fresh pair of eyes look at code you've gone over and over. – Reinstate Monica -- notmaynard Mar 04 '13 at 15:23
  • That regex with its nested quantifiers and overlapping scopes of alternation looks like a high-risk candidate for catastrophic backtracking. @JonTaylor's requirements can be specified much more concisely and precisely. See Bergi's comment. – Tim Pietzcker Mar 04 '13 at 15:27
  • 1
    @TimPietzcker Right, hence the simpler version I gave. – Reinstate Monica -- notmaynard Mar 04 '13 at 15:28
  • Ah, I overlooked that last line (but Bergi beat you to it :)) – Tim Pietzcker Mar 04 '13 at 15:30
  • @TimPietzcker actually he beat Bergi to it :) – Jon Taylor Mar 04 '13 at 15:34
  • `\w` is not a standardized character class, and may include symbols other than alphanumerics. Specifically, in Perl it's defined as `[A-Za-z0-9_]`. Caveat emptor. – Todd A. Jacobs Mar 04 '13 at 15:40
  • @CodeGnome ok thanks, so as someone else suggested in a comment I should use [A-Za-z0-9], also if I only wanted lowercase on a UTF8 based system how would I do this since [a-z] would also cover uppercase on a utf8 system. – Jon Taylor Mar 04 '13 at 15:45
  • 1
    I believe `[a-z]` only matches lowercase regardless of character encoding. – Reinstate Monica -- notmaynard Mar 04 '13 at 16:20
  • doesnt it count as a character range? since UTF encodings group letters AaBbCc etc then a-z will match some uppercase too. A-Z would also have the same problem. In ASCII this is not a problem since A-Z and a-z are seperate groups of letters within the encoding. – Jon Taylor Mar 04 '13 at 16:24
  • 1
    UTF encodes the Latin alphabet exactly the same as ASCII. If there is a problem with the locale, you can use the POSIX standard class `[:lower:]` (assuming your language is POSIX-compliant). I don't know when this would really be an issue, but if it is needed, there you go. – Reinstate Monica -- notmaynard Mar 04 '13 at 16:42
2

Use PCRE with POSIX Class

First, we need to clean up your corpus since they contain dashes. Next, we add a line or two that will definitely fail so we have a sad path for testing. This yields the following corpus:

# /tmp/corpus
ABC
Type 1
Type A
Hello A
Hello Wo
H A B
H AB
AB H
ab $ cd

Next, we use an anchored Perl-compatible regular expression with a POSIX class that only includes alphanumeric values. We use negative lookahead to prevent trailing spaces, but allow a single space between words.

$ pcregrep '^([[:alnum:]]+(?!= $) ?)+$' /tmp/corpus
ABC
Type 1
Type A
Hello A
Hello Wo
H A B
H AB
AB H

As expected, this yields the 8 valid lines you were expecting. Success!

Todd A. Jacobs
  • 81,402
  • 15
  • 141
  • 199
  • This doesn't reject strings that end in spaces. – Tim Pietzcker Mar 04 '13 at 15:28
  • @TimPietzcker Thanks for pointing that out; I fixed it with a negative lookahead. Of course, not all greps or regular expression engines support that feature, but I think this result is a lot easier to read if your tool supports it. – Todd A. Jacobs Mar 04 '13 at 15:37
  • `(?!=$)` doesn't do what you think it does. Instead, you need a lookbehind `(?<! )` at the end of the string. A lookahead won't work, unless you use `(?=\w$)`. – Tim Pietzcker Mar 04 '13 at 16:44
0

\w would matches _ as well as alphanumerics. So if you don't want to match underscores you'd have to use [a-zA-Z\d] instead.

The following expression should cover your needs:

^[a-zA-Z\d]+(?: [A-Za-z\d]{2,})*$

Alternatively you could use the following if {min,max} repetition is not supported.

^[A-Za-z\d]+(?: [A-Za-z\d][A-Za-z\d]+)*$

We need the {min,max} or double character group because of your requirement of minimum 2 characters from the second word onwards.

If underscores are allowed then the following expressions would be better:

^\w+(?: \w{2,})*$

or without {min,max}:

^\w+(?: \w\w+)*$

rvalvik
  • 1,559
  • 11
  • 15