2

I need a Regex that matches an email-adress, local@domain with following requirements:

Local-part can contain A-z, 0-9, dot, underscore and dash.

Domain can contain A-z, 0-9, dot and dash. The domain needs to contain at least one dot.

How do I make sure that: The domain can't start with or end with dot or dash. And that that the domain needs to contain at least one dot?

These are the two things that really cause me problems in trying to solve it.

Have tried the following:

Regex.IsMatch(email, @"(?:[^.-])([\w.-])@([\w.-])(?:[.-]$)");
John Lag
  • 51
  • 1
  • 1
  • 6
  • Well the question is (as stupid as it is): Why doesn't this regular expression work? What am I not getting? I am pretty new to regular expressions, but have tried my best to construct one. However, it doesn't pass my unit-tests and I can't seem to get more data in regards to why it doesn't work. @hatchet – John Lag Apr 28 '17 at 23:11
  • If you want an email regex, the two questions linked above answer that. If you literally just want to know why the regex you wrote does not deliver the requirements you listed, then you might want to edit your question and title so your question isn't closed as a duplicate. – hatchet - done with SOverflow Apr 28 '17 at 23:24
  • Did read both the linked questions before posting. Might be bad wording, but my main problem is in regards to the two (now specified) problems given the before-mentioned specs. – John Lag Apr 28 '17 at 23:37

4 Answers4

9

The correct answer is:

ONLY validate that @ is present, the domain portion matches a few simple rules, the local part is <= 64 characters, and the whole thing is <= 254 characters.

Yeah, exclude completely illegal characters, ok. And make sure you use the last @ symbol, not the first. See RFC 1035 for all the goods on valid domain names. Maybe RFC 819 could help.

If you are using HTML, then just use an email input, <input type="email" autocomplete="off" autocorrect="off">, and you'll get most of the right stuff out-of-the-box enforced by the browser itself, with no work on your part. See email input validation at MDN. Though be aware that even this can be too restrictive depending on the browser. See this bug where the proper behavior for browsers is to accept unicode in email addresses (accept IDN labels), then perform translation from a U-label to an A-label, and only then perform the validation.

You can, if you want to, also check if the domain can be found in DNS, but this is an unnecessary and effort-heavy step.

Why am I yelling this in giant print? Because so. Stinking. Many sites get this completely and horribly wrong.

First, read this: I Knew How To Validate An Email Address Until I Read The RFC.

Here, reason with me for a moment.

Given, as the article says, there are basic rules about:

  • the @ symbol,
  • maximum lengths of the various parts,
  • completely forbidden characters,
  • and the obvious fact that the ONLY determiner of whether an email address is valid is the issuer of that email address—the domain owner,

Then, the only determiner of whether an email address will actually reach anyone is to send an email to the address, and see if the person gets it, because this is the only way to ask the issuer of that email address if there's an actual user account associated with it.

Think about postal mail. Let's suppose someone gives you a funny address, say, this one:

AAB!129 Thor Circle 1/2 atomized Pile$
Armelioborrigenduliamo, GRICKL, θ-niner *
18957382:90347342;21017900~19127734.6
THE MOON

Since you've likely never sent mail to the moon before, are you SURE you want to judge lunar mail addresses by the standards of the region you're familiar with? How do YOU know that's not a valid address? What if those folks just do it weird? If you were a company planning to do business with your customers—and make piles of money—why do you care if their address is weird just so long as the address works?

In fact, this reality that you can't validate another authority's address is proven by a standard business practice in the U.S.: postal address scrubbing. What that means is, when someone submits a mailing address to you, you send off an API call to the US Postal Service asking if this is a valid address, and furthermore asking for the canonical form of it. That's because only the post office can tell you if the address is valid. And even then, you don't know if your letter will get to anyone until you try sending one!

Why, then, would you be so presumptive as to deny someone from using a perfectly valid email address, known by their email provider to be valid (similar to sending mail to another country or even another planet), just because it has some format you're not used to or that you ignorantly assume is wrong?

If you are just trying to avoid bad email addresses due to typos, you can still do that. Display to the user "Hey, something about your address doesn't look quite right. Are you sure that it contains these characters you've selected? !#$%^&*()"{}[]`~ Remember, if we can't email you, you can't create an account." Then people get the warning, but if they really want to, they can still submit it. (Okay, yeah, exclude completely forbidden characters. The ones I listed are not necessarily valid. Look it up. You should look this up. Really. You shouldn't take the word of some random internet person. Get understanding.)

Go ahead and even make it slightly painful—make them submit twice. Or check a box and submit a second time. Just don't stop them from using whatever they want.

I have personally at times decided NOT to use web sites or services that could not accept email addresses with a plus sign in the local part (before the @). And if I simply must have the account, I grit my teeth, get a little bit angry, and then submit a different address than the one I really want to use.

Unless you really want to reduce the set of customers you can do business with. Then go ahead and be too restrictive...

Okay, at this point, you hate me

You think I'm overreacting. You just want to validate your email addresses! Can it be so hard? In fact, you're just going to ignore me and go ahead and write one that will do the job. It's good enough.

Well fine. If you won't listen to sense or reason, then here's some regex for you that does it right. (Only, I don't actually know if it does it right, but I'm willing to bet it does it a darn sight closer to right than anything anyone here is going to come up with on their own in less than days and days of work.)

The magic email-validating regex

(?:(?:\r\n)?[ \t])*(?:(?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t]
)+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:
\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(
?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ 
\t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\0
31]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\
](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+
(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:
(?:\r\n)?[ \t])*))*|(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z
|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)
?[ \t])*)*\<(?:(?:\r\n)?[ \t])*(?:@(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\
r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[
 \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)
?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t]
)*))*(?:,@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[
 \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*
)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t]
)+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*)
*:(?:(?:\r\n)?[ \t])*)?(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+
|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r
\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:
\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t
]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031
]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](
?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?
:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?
:\r\n)?[ \t])*))*\>(?:(?:\r\n)?[ \t])*)|(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?
:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?
[ \t]))*"(?:(?:\r\n)?[ \t])*)*:(?:(?:\r\n)?[ \t])*(?:(?:(?:[^()<>@,;:\\".\[\] 
\000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|
\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>
@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"
(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t]
)*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\
".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?
:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[
\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*|(?:[^()<>@,;:\\".\[\] \000-
\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(
?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)*\<(?:(?:\r\n)?[ \t])*(?:@(?:[^()<>@,;
:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([
^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\"
.\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\
]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*(?:,@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\
[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\
r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] 
\000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]
|\\.)*\](?:(?:\r\n)?[ \t])*))*)*:(?:(?:\r\n)?[ \t])*)?(?:[^()<>@,;:\\".\[\] \0
00-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\
.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,
;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?
:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*
(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".
\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[
^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]
]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*\>(?:(?:\r\n)?[ \t])*)(?:,\s*(
?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\
".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(
?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[
\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t
])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t
])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?
:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|
\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*|(?:
[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\
]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)*\<(?:(?:\r\n)
?[ \t])*(?:@(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["
()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)
?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>
@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*(?:,@(?:(?:\r\n)?[
 \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,
;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t]
)*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\
".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*)*:(?:(?:\r\n)?[ \t])*)?
(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".
\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:
\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\[
"()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])
*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])
+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\
.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z
|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*\>(?:(
?:\r\n)?[ \t])*))*)?;\s*)

Of course, it's been line-broken. Remove the newlines.

Community
  • 1
  • 1
ErikE
  • 48,881
  • 23
  • 151
  • 196
  • 1
    I appreciate the feedback and the effort. I really do. And in reality you are probably right. However, my question is not: How do I make a optimal solution (which in regards to email and regex probably is not to do it and just sent a validation mail). I have a specific task with a clear specification and a problem that I need to solve. I hope some can help with that. – John Lag Apr 29 '17 at 00:14
  • 1
    @JohnLag I appreciate your attitude. I hope you detected that I'm not really quite as mad as my answer might make it seem. I hoped that some humor would make people sit up and pay attention. And while I have sympathy for your clear specifications, I honestly, completely, sincerely believe the right action for you is to push back and say *this is not the right thing to do, and it could lose us business*. Show them my answer! For what it's worth, I'm pretty good with regex and could probably solve your problem in no time, but I am finding it difficult to make myself serve evil... grin. – ErikE Apr 29 '17 at 00:19
  • @JohnLag Show them the RFC. Erik is right. – 15ee8f99-57ff-4f92-890c-b56153 Apr 29 '17 at 00:34
  • 1
    @ErikE - after reading your two answers, I wish now that I hadn't voted to close the question. – hatchet - done with SOverflow Apr 29 '17 at 17:55
1

EDIT

I think this would work:

Regex regex = new Regex(@"^([\w\.\-]+)@((?!\.|\-)[\w\-]+)((\.(\w){2,3})+)$");

NOTE: domain names cannot have dots and spaces

However, instead of using a regex you can try using Mail Address Class. This way you don't have to break your head over understanding someone else's regex

public bool IsEmailValid(string address)
{
    try
    {
        MailAddress m = new MailAddress(address);
        return true;
    }
    catch (FormatException)
    {
       return false;
    }
}
FortyTwo
  • 2,414
  • 3
  • 22
  • 33
  • Makes sense. However, I can't see how this regex is making sure that the domain doesn't start with or end with dot or dash. And I did read about the MailAdress class before asking the question. The implementation didn't match the specs I was given. – John Lag Apr 28 '17 at 23:46
  • No, that's wrong. [I Knew How To Validate An Email Address Until I Read The RFC](http://haacked.com/archive/2007/08/21/i-knew-how-to-validate-an-email-address-until-i.aspx/) – ErikE Apr 29 '17 at 00:01
  • @JohnLag I have edited the regex to ensure the domain cannot start with a hyphen or dot – FortyTwo Apr 29 '17 at 00:14
  • @ErikE the specifications for the regex differ from a valid email address. foo/@foo.com would be a valid email address but the regex required should not accept it. So it is not wrong! – FortyTwo Apr 29 '17 at 00:22
  • I know you're attempting to correctly answer the question as given, and I applaud that, but think the RIGHT answer is to say "don't do that". I have to go with my convictions. In a bit I'm going to put in an answer that competes with yours, feel free to downvote it. (But please wait a few minutes, don't downvote my answer saying the *right* way to do it.) – ErikE Apr 29 '17 at 00:24
  • The right way would be to use the MailAddress class which I have mentioned in my answer that is if you want to validate any email. The question here states a different set of requirements. So the right way is quite difficult to comment on since one doesn't know the use case. Maybe they want to validate an email for their own domain and prevent spooky addresses. – FortyTwo Apr 29 '17 at 00:37
  • Inside of a character class `[]` you do NOT need to escape `.` or `-` (as long as the latter is first or last). Second, plenty of domain names are longer than 3 characters, e.g., `.info`. Your RegEx goes beyond the asker's requirements and also is broken. – ErikE Apr 29 '17 at 18:24
  • The use of the word "think" implies the solution is not 100% correct but something in those lines. Rather than voting down and commenting the same on every proposed answer to make your point, why not try and help the asker and be more constructive. Yes, the domain can be 4 characters long or 5 or 6 or up to 253 if I am not wrong. If you are aware of the domain length and email length why did you overlook it in your reply? "Your exact requirements" solution is also broken since it will validate an email address of any length. – FortyTwo May 02 '17 at 13:05
1

In a fit of dark and self-destructive insanity, I have decided to answer your question.

Your exact requirements:

  • Local-part can contain A-z, 0-9, dot, underscore and dash.
  • Domain can contain A-z, 0-9, dot and dash.
  • The domain needs to contain at least one dot.
  • The domain can't start with or end with dot or dash.

Your RegEx that meets these exact requirements and no more (case-insensitive match):

^[\w.-]+@(?=[a-z\d][^.]*\.)[a-z\d.-]*[^.]$

Try it out at regex101.com

I took some care to make sure this does almost no backtracking. At regex101.com you can see how many steps it took. Good addresses validate in 13 steps, which is pretty good. Addresses with a dot at the end are the worst performance because this causes backtracking one character at a time over the whole domain portion, but they are probably rare.

Please vote down this post as much as you vote down the other posts attempting to answer the question as given.

Then, see my other answer on this page and vote it up.

Breakdown:

Regex       Explanation
##########  #######################################################
         ^  start of string anchor, zero-width match

##########  "local" part
   [\w.-]+  character class, one or more of: word character, dot, or dash

##########  "@"
         @  literal @ character

##########  "domain" part
       (?=  begin positive lookahead group, zero-width match
   [a-z\d]  must begin with a letter or digit
     [^.]*  match zero or more characters that aren't literal .
        \.  match one literal . character
         )  end positive lookahead group
[a-z\d.-]*  match zero or more: letters, digits, dot, or dash
      [^.]  match one or more characters that aren't literal .
         $  end of string anchor, zero-width match

Notes:

The positive lookahead is what does the job of asserting that the domain portion begins with a non-dot, and contains at least one dot after that. Lookaheads like this (and zero-width matches in general) can help avoid excess complexity in other parts of a regex. Multiple lookaheads can actually be used one after the other to assert different things about the remainder of the regex. This helps a regex address multiple concerns that would otherwise, if matched together in normal character-consuming patterns, would create extremely complex regexes.

It is best to avoid lookbehinds if possible, in part because some RegEx flavors do not support them, and the ones that do mostly don't support variable-width lookbehinds. (A possible workaround is to use a lookahead followed by a more liberal character-consuming match that ends at the same position as the lookahead does, and if doing replacement, replace the literally matched characters with themselves.)

In C#, \w includes a wide range of Unicode characters. This may or may not be what you're looking for. If not, you can leave the Regex as is and use ECMAScript-compliant mode. Or, you can just change it to a-z0-9_ (inside square brackets). But \w is shorter.

\d also includes some additional numeric characters:

\d matches any decimal digit. It is equivalent to the \p{Nd} regular expression pattern, which includes the standard decimal digits 0-9 as well as the decimal digits of a number of other character sets.

You can again use ECMAScript-compliant mode, or just change it to 0-9. But \d is shorter.

Be aware that there are plenty of ways that this regex is seriously not good. It allows IP addresses in the domain portion (incorrect), it doesn't limit the total length of the regex or the length of the domain portion. It improperly restricts characters from the local portion that it should not restrict. It's not a good specification at all.

ErikE
  • 48,881
  • 23
  • 151
  • 196
0

You should learn regular expressions from a source such as http://www.regular-expressions.info - the attempt so far displays much missing knowledge, and the problem stated is significantly complex (even ignoring that a custom regex is almost certainly the wrong approach, though it might be a useful pre-filter).

Why doesn't this regex work - @"(?:[^.-])([\w.-])@([\w.-])(?:[.-]$)");

I'll explain by breaking the regex down into English (which in general is a great technique for regexes):

First off, all the brackets here serve no functional purpose so I'm ignoring them (see tutorial for what they mean)

local

  • [^.-] - 1 character that's not a dot or dash
  • [\w.-] - 1 character that's alphanumeric or a dot or a dash

So the definition of local is any string that ends with the 2 character string with the above constraints.

@ - Literal, the character '@'

domain

  • [\w.-] - 1 character that's alphanumeric or a dot or a dash
  • [.-] - 1 character that's a dot or a dash
  • $ - End of the string.

So the definition of domain is a 2 character string with the constraints above.

This is clearly very far from the given problem.

What is a regex that satisfies the given constraints?

Regexes are essentially evaluated left-right in sequence. Express your constraints in a sequential set of descriptions, then translate those into regex constructs. I'll do this for precisely the given constraints (which I don't think are complete). Mentally insert a 'followed by' between every line.

beginning of string - ^ - Regex way to express beginning of string

local - [\w._-]* - Any number of (alphanumeric, dot, dash).

@ - @ - Literal character

domain

The key requirement is at least 1 dot. This dot will be explicitly present in the regex, so think of domain as {preDot}{dot}{postDot}. For simplicity, define {dot} as the first occurrence of ..

  • \w - Single alphanumeric character - this is the doesn't start with dot or dash requirement
  • [\w-]* - Any number of characters that are alphanumeric or dash
  • \. - Single (first) dot character - this is the special must exist dot
  • (\w*[\.-])* - Any number of (any number of alphanumeric characters followed by a dot or dash)
  • [\w-]+ - 1 or more alphanumeric or dash characters - this is the must not end in dot requirement

end of string - $ - Regex way to express end of string

And here is the corresponding code:

var literal = @"\w*";

var preDot = @"\w[\w-]*";
var dot = @"\.";
var postDot = @"(\w*[\.-])*[\w-]+";
var domain = $"{preDot}{dot}{postDot}";

var email = $"^{literal}@{domain}$";

FYI - the regex ends up being ^\w*@\w[\w-]*\.(\w*[\.-])*[\w-]+$ but that's largely irrelevant, it would be horrible to try to understand/maintain/change it as a single string, while the breakdown is followable.

Nabeel
  • 114
  • 2
  • So `@a.b` is a valid email address? (Looking only at your final regex.) – ErikE Apr 29 '17 at 18:05
  • `@a.b` is a string that matches the given constraints in the original question exactly - they allow `@a.b`. Which is exactly what the answer sets out to do - read it fully. – Nabeel Apr 30 '17 at 20:50
  • You read between the lines. So did I. I think your interpretation didn't quite get everything. That's all. It's just an opinion. – ErikE Apr 30 '17 at 22:27