Regex for multiple words split by spaces

Question

I am at the point where I am banging my head against my desk, to the amusement of my colleagues. I currently have the following regex

(^[\w](( \w+)|(\w*))*[\w]$)|(^\w$)

What I want it to do is match any string which contains only alphanumeric characters, no leading or trailing whitespace and no more than one space between words.

A word in this case is defined as one or more alphanumeric characters.

This matches most of what I want, however from testing it also thinks the second word onwards must be of 2 characters or more in length.

Tests:

ABC - Pass
Type 1 - Fail
Type A - Fail
Hello A - Fail
Hello Wo - Pass
H A B - Fail
H AB - Pass
AB H - Fail

Any ideas where I'm going wrong?

@Bergi: This should be an answer. It doesn't get any simpler and better than this (well, OK, you could use a non-capturing group). — Tim Pietzcker, Mar 04 '13 at 15:28
I was not sure whether all these capturing groups in the OPs complicated version might have been intended… — Bergi, Mar 04 '13 at 15:38
@Bergi tbh I was just trying anything at this point and the regex just kept getting bigger and bigger haha. Your answer is exactly what I was intending. Thanks — Jon Taylor, Mar 04 '13 at 15:40

Reinstate Monica -- notmaynard · Accepted Answer · 2013-03-04T15:16:10.607

9

Your regex is close. The cause of your two-character problem is here:

(^[\w](( \w+)|(\w*))*[\w]$)|(^\w$)
       right here ---^

After matching the group ( \w+), i.e. a space followed by one or more \w, which every word after the first must match because of the space, you then have another mandatory \w -- this is requiring the final word in the string to have two or more characters. Take that one out and it should be fine:

(^[\w](( \w+)|(\w*))*$)|(^\w$)

A simpler version would be:

^\w+( \w+)*$

edited Mar 04 '13 at 15:16

answered Mar 04 '13 at 15:09

Reinstate Monica -- notmaynard

5,464
2
26
42

2

Naw, these things can be tricky and messy. It always helps to have a fresh pair of eyes look at code you've gone over and over. – Reinstate Monica -- notmaynard Mar 04 '13 at 15:23
That regex with its nested quantifiers and overlapping scopes of alternation looks like a high-risk candidate for catastrophic backtracking. @JonTaylor's requirements can be specified much more concisely and precisely. See Bergi's comment. – Tim Pietzcker Mar 04 '13 at 15:27
1

@TimPietzcker Right, hence the simpler version I gave. – Reinstate Monica -- notmaynard Mar 04 '13 at 15:28
Ah, I overlooked that last line (but Bergi beat you to it :)) – Tim Pietzcker Mar 04 '13 at 15:30
@TimPietzcker actually he beat Bergi to it :) – Jon Taylor Mar 04 '13 at 15:34
`\w` is not a standardized character class, and may include symbols other than alphanumerics. Specifically, in Perl it's defined as `[A-Za-z0-9_]`. Caveat emptor. – Todd A. Jacobs Mar 04 '13 at 15:40
@CodeGnome ok thanks, so as someone else suggested in a comment I should use [A-Za-z0-9], also if I only wanted lowercase on a UTF8 based system how would I do this since [a-z] would also cover uppercase on a utf8 system. – Jon Taylor Mar 04 '13 at 15:45
1

I believe `[a-z]` only matches lowercase regardless of character encoding. – Reinstate Monica -- notmaynard Mar 04 '13 at 16:20
doesnt it count as a character range? since UTF encodings group letters AaBbCc etc then a-z will match some uppercase too. A-Z would also have the same problem. In ASCII this is not a problem since A-Z and a-z are seperate groups of letters within the encoding. – Jon Taylor Mar 04 '13 at 16:24
1

UTF encodes the Latin alphabet exactly the same as ASCII. If there is a problem with the locale, you can use the POSIX standard class `[:lower:]` (assuming your language is POSIX-compliant). I don't know when this would really be an issue, but if it is needed, there you go. – Reinstate Monica -- notmaynard Mar 04 '13 at 16:42

Todd A. Jacobs · Answer 2 · 2013-03-04T15:44:40.030

2

Use PCRE with POSIX Class

First, we need to clean up your corpus since they contain dashes. Next, we add a line or two that will definitely fail so we have a sad path for testing. This yields the following corpus:

# /tmp/corpus
ABC
Type 1
Type A
Hello A
Hello Wo
H A B
H AB
AB H
ab $ cd

Next, we use an anchored Perl-compatible regular expression with a POSIX class that only includes alphanumeric values. We use negative lookahead to prevent trailing spaces, but allow a single space between words.

$ pcregrep '^([[:alnum:]]+(?!= $) ?)+$' /tmp/corpus
ABC
Type 1
Type A
Hello A
Hello Wo
H A B
H AB
AB H

As expected, this yields the 8 valid lines you were expecting. Success!

edited Mar 04 '13 at 15:44

answered Mar 04 '13 at 15:27

Todd A. Jacobs

81,402
15
141
199

This doesn't reject strings that end in spaces. – Tim Pietzcker Mar 04 '13 at 15:28
@TimPietzcker Thanks for pointing that out; I fixed it with a negative lookahead. Of course, not all greps or regular expression engines support that feature, but I think this result is a lot easier to read if your tool supports it. – Todd A. Jacobs Mar 04 '13 at 15:37
`(?!=$)` doesn't do what you think it does. Instead, you need a lookbehind `(?<! )` at the end of the string. A lookahead won't work, unless you use `(?=\w$)`. – Tim Pietzcker Mar 04 '13 at 16:44

score 0 · Answer 3 · answered Mar 04 '13 at 15:34

\w would matches _ as well as alphanumerics. So if you don't want to match underscores you'd have to use [a-zA-Z\d] instead.

The following expression should cover your needs:

^[a-zA-Z\d]+(?: [A-Za-z\d]{2,})*$

Alternatively you could use the following if {min,max} repetition is not supported.

^[A-Za-z\d]+(?: [A-Za-z\d][A-Za-z\d]+)*$

We need the {min,max} or double character group because of your requirement of minimum 2 characters from the second word onwards.

If underscores are allowed then the following expressions would be better:

^\w+(?: \w{2,})*$

or without {min,max}:

^\w+(?: \w\w+)*$

Regex for multiple words split by spaces

3 Answers3

Use PCRE with POSIX Class

Linked