2

I am new to regular expressions. I have been reading about regex for last couple of hours to understand how to use regex* to achieve the following, but with not much luck. My brain has started hurting. Hence this call for help. Following are the restrictions I want to apply to a data input field, what regular expression should I use?

  1. The first and last character should be either alphanumeric, "." (i.e. dot) or "_" (i.e. underscore)
  2. The characters between the first and last charatcers can be alphanumeric, "." (i.e. dot), "_" (i.e. underscore) or "-" (i.e. hyphen)
  3. Characters "." (i.e. dot) and "-" (i.e. hyphen) cannot appear consecutively.
  4. There should be atleast one alphanumeric character in the input.

Some valid input data:

.abc_
__abc.d-e.
.__a.
.a__b.
_a-b.
abc
a___.

Thanks and regards,

~Plug

  • I am using a third-party library that internally uses boost-regex to parse the expression.

3 Answers3

2

You should really show what you've shown so far.

That said, a regex to cover your restrictions should look a little like this:

^[a-zA-Z0-9\._](?:[a-zA-Z0-9_]*(?:\.(?!\.))*(?:-(?!-))*[a-zA-Z0-9_]*)*[a-zA-Z0-9\._]$

Someone might well come along with a nicer formatted one but it seems to work in http://www.regex101.com/ for everything I've tested it on.

ydaetskcoR
  • 53,225
  • 8
  • 158
  • 177
  • 1
    I've made a variation of your regex, it handles rule 3 for the first and last char, and also rule 4 (by way of a positive lookahead.) `^(?=.*[[:alnum:]])[[:alnum:]_.](?:[[:alnum:]_]|(?<!\.)\.|-(?!-))*(?:[[:alnum:]_]|(?<!\.)\.)$` – Hasturkun Jun 04 '13 at 16:17
  • The above regex works ok when there is a match, but can easily go into [catastrophic backtracking](http://www.regular-expressions.info/catastrophic.html) when it doesn't match. i.e. it has the classic form: `^(a*a*)*$` and when applied to the string `"aaaaaaaaaab"` requires many, _many_ iterations to declare match failure. – ridgerunner Jun 05 '13 at 15:37
2

This is very messy to do with a single regex. Not actually impossible, but you'd be jumping crazy hoops to do it, such that you'd be better off writing a state machine. However, it's easy to do this with a series of regex tests.

For your conditions 1 and 2 the text should match the following (allowing that the text may be only one character long):

 ^([a-z0-9._]|[a-z0-9._][a-z0-9_.-]*[a-z0-9._])$  

For your condition 3, the text should not match one of these regex (choose as appropriate, your spec is not quite clear).

 .*[.-][.-].*
 .*(.-|-.).*
 .*(\.\.|--).*

For your condition 4, the text should match the following:

 .*[a-z0-9].*  

I haven't allowed for upper case characters here. Add those to the character patterns if required.

mc0e
  • 2,699
  • 28
  • 25
0

Interesting problem. Can be solved with a non-trivial regex. Here it is in Java syntax (which requires the regex to be enclosed in a string.)

Pattern re_valid = Pattern.compile(
    "    # Regex to validate special word requirements.                                   \n" +
    "    ^                             # Anchor to start of string. And...                \n" +
    "    (?=[A-Za-z0-9._])             # First char is alphanum, dot or underscore. And...\n" +
    "    (?=.*[A-Za-z0-9._]$)          # Last char is alphanum, dot or underscore. And... \n" +
    "    (?=[^A-Za-z0-9]*[A-Za-z0-9])  # Contains at least one alphanum.                  \n" +
    "    (?:                           # Group two possible content formats.              \n" +
    "      [A-Za-z0-9_]+               # Case 1: Begins with one or more non-[-.].        \n" +
    "      (?:                         # Zero or more [-.] separated parts.               \n" +
    "        [-.]                      # Each part separated by one [-.],                 \n" +
    "        [A-Za-z0-9_]+             # followed by one or more non-[-.].                \n" +
    "      )*                          # Zero or more [-.] separated parts.               \n" +
    "      [.]?                        # May end with one [-.].                           \n" +
    "    | [.]                         # Or Case 2: Begins with hyphen or dot.            \n" +
    "      (?:                         # Zero or more [-.] separated parts.               \n" +
    "        [A-Za-z0-9_]+             # One or more non-[-.],                            \n" +
    "        [-.]                      # followed by one [-.].                            \n" +
    "      )*                          # Zero or more [-.] separated parts.               \n" +
    "      [A-Za-z0-9_]*               # May end with zero or more non-[-.].              \n" +
    "    )                             # End group of two content alternatives.           \n" +
    "    $                             # Anchor to end of string.                         ", 
    Pattern.COMMENTS);
ridgerunner
  • 33,777
  • 5
  • 57
  • 69
  • For extra kudos, write it so it processes the input from start to finish with no backtracking. – mc0e Jun 05 '13 at 14:36