1

I'm trying to come up with a regex pattern that will match any domains in this format:

example.com

but not this:

subdomain.example.com

Currently it needs to only cover the main TLDs (com, net, org), but I'd like it to be able to handle others (like co.uk, com.br, etc.) for flexibility.

So far I've got this, but it definitely needs some work:

^[^w].*\.[a-z]{3}.*$

Could a regex ninja help me out?

EDIT: The regex will be used in PHP, and there is never a protocol on the beginning of the string to match due to the setup of the script. I'd have to dig more into the script to get more details on why this is true, but I believe it is just grabbing the host name from the PHP $_SERVER variable.

EDIT 2: Perhaps this would work to cover anything but a period up to something matching .xyz or .xyz.ab or .xyz.abc ^[^.]+(\.[^.]{3}|\.[^.]{2,3}\.[^.]{2,3}).*$

EDIT 3: I've got the nearly completed pattern: updated below (php requires / and / at the beginning and end) Can anyone poke holes in the implementation? It appears to be working as expected.

EDIT 4: This is where I'm currently at: updated below It matches nearly what I want, though it requires the / at the beginning of the filepath so example.com does not match, while example.com/test does. I can't get it to match example.com without matching the ".exa" in "www.example.com".

EDIT 5: Ok, we've got a winner: /^[^.]+((\.[^.\/]{1,3}\b){1,2}).*$/

Matches:
example.com
example.co.uk
example.com/test.php?a=b
example.co.uk/test.php?a=b
123.com
1234.com
www.123.com (matches all URLs with domains shorter than 4 characters)

Doesn't match:
www.example.com
www.example.co.uk
www.example.com/test.php?a=b
www.example.co.uk/test.php?a=b
test.example.com/test.php?a=b
test.example.co.uk/test.php?a=b
www.1234.com

MikeSmitty
  • 27
  • 1
  • 9
  • What type of regex are we talking about? A mod-rewrite rule? Perl? PowerShell? – sysadmin1138 Feb 23 '11 at 23:44
  • Is this supposed to match an entire URL? And is this supposed to deal with any URL possible (`ftp://user:password@example.com:10021/some/dir/thing.exe`) or only the sane ones? – DerfK Feb 23 '11 at 23:45
  • It'll be used by a PHP script, and yes it would ideally match a whole URL, but also as little as the domain name. It only needs to deal with sane URLs, anything else will be covered by a blanket rule. – MikeSmitty Feb 23 '11 at 23:49
  • The script is a page that is a catch-all redirect. Anything that matches a list of URLs/regexes is redirected to the appropriate page. What we want it to do is take any URL in the form of example.com and redirect it to www.example.com, but not sub.example.com for any sane domain name. Anything else not covered hits a blanket rule. Also, currently we don't account for ports in the domain name, but it's grabbing the hostname from the PHP $_SERVER variable so I don't believe we need to account for that. – MikeSmitty Feb 23 '11 at 23:51

2 Answers2

2

What language are you using?

In general it sounds like you want something that matches the basic aspects of a domain, ruling out the possibility of a period other than the one that delinates the .tld.

#http://[^.]+\.(com|net|org)#i

If you don't want to match the protocal, maybe something like this.

#[^. ]+\.(com|net|org)#i

Your desire to handle multi-part TLD's will really screw this up, you will need to maintain a manual list of all the ones you want to match. The only alternative is to do DNS lookups to determine the listing type. There really isn't another way to extract subdomain data from the domain with a regular expression because by rights domains are actually just subdomains of some TLD (top level domain).

Edit: To match TLD's assuming they woudl have less than four characters, you can play around with something like this. You're going to have to work out what constitutes the start and end of a match. Are you requireing the presense of a protocal? Is this in a paragraph where somebody could type.a period out of context? If you give more details on the parameters we might be able to provide a more precise solution.

[^.]+((\.[^.]{0,3})+)
Caleb
  • 11,813
  • 4
  • 36
  • 49
  • What if we went on the assumption that no domain will be shorter than 4 characters? (e.g. abcd.com) Anything shorter than than 4 characters currently doesn't exist, and is extremely unlikely to happen in this situation. – MikeSmitty Feb 24 '11 at 00:03
  • So maybe something like `^[^\.]*\.([a-zA-Z]{3}|[a-zA-Z]{2,3}\.[a-zA-Z]{2,3}).*$` – MikeSmitty Feb 24 '11 at 00:13
  • I believe that last rule will match anything but "." from the beginning to the first period it matches, then either something in the form of xyz or xyz.ab – MikeSmitty Feb 24 '11 at 00:15
  • You'll probably want to use + instead of * to make sure you you get a match on the domain part and not an empty string. After that it looks alright to match two TLD segments of 2 or three characters each. – Caleb Feb 24 '11 at 00:19
  • The source string is the URL given to the script from the web server, so it would be expected to be a legitimate URL, and the protocol is never present. – MikeSmitty Feb 24 '11 at 00:19
  • Will this regex you posted match more than one instance of a 0-3 character word? `[^.]+((\.[^.]{0,3})+)` – MikeSmitty Feb 24 '11 at 00:30
1

One of the best resources on the net for this is regexlib:

http://regexlib.com/Search.aspx?k=URL

http://regexlib.com/Search.aspx?k=TLD

There are numerous examples of matching protocol and TLD or entire querystring for validity

iivel
  • 231
  • 1
  • 4