4

First lets define a "URL" according to my requirements.

The only protocols optionally allowed are http:// and https://

then a mandatory domain name like stackoverflow.com

then optionally the rest of url components (path, query, hash, ...)

For reference a list of valid and invalid url's according to my requirements

VALID

INVALID

  • http://www (php filter_var allow this, yes i know is a valid url)
  • google
  • http://www..des (php filter_var allow this)
  • Any url with not allowed characters in the domain name

For completeness here is my php version: 5.3.2-1ubuntu4.2

Community
  • 1
  • 1
Cesar
  • 4,076
  • 8
  • 44
  • 68
  • You have domain names under a TLD with dashes? Show me. – Pekka Sep 06 '10 at 19:31
  • I hope you do know that now there are [internationalized domain names](http://en.wikipedia.org/wiki/Internationalized_domain_name) which can make URL-validating regexes pretty messy. – NullUserException Sep 06 '10 at 19:32
  • Also, there are lots of things "valid" URLs can contain and are not specified in your question. For the complete spec see this: http://www.w3.org/Addressing/URL/url-spec.txt – NullUserException Sep 06 '10 at 19:37
  • @Pekka http://www.expert-sex-change.com used to redirect to stackoverflow.com (now expired) – NullUserException Sep 06 '10 at 19:39
  • 1
    @Null aarrgh, of course, I mixed it up with underscores `_`. I own a number of domains with dashes myself. Time for me to call it a day! I don't understand how filter_var can reject this, though. – Pekka Sep 06 '10 at 19:46

3 Answers3

3

As a starting point you can use this one, it's for JS, but it's easy to convert it to work for PHP preg_match.

/^(https?\://)?(www\.)?([a-z0-9]([a-z0-9]|(\-[a-z0-9]))*\.)+[a-z]+$/i

For PHP should work this one:

$reg = '@^(https?\://)?(www\.)?([a-z0-9]([a-z0-9]|(\-[a-z0-9]))*\.)+[a-z]+$@i';

This regexp anyway validates only the domain part, but you can work on this or split the url at the 1st slash '/' (after "://") and validate separately the domain part and the rest.

BTW: It would validate also "http://www.domain.com.com" but this is not an error because a subdomain url could be like: "http://www.subdomain.domain.com" and it's valid! And there is almost no way (or at least no operatively easy way) to validate for proper domain tld with a regex because you would have to write inline into your regex all possible domain tlds ONE BY ONE like this:

/^(https?\://)?(www\.)?([a-z0-9]([a-z0-9]|(\-[a-z0-9]))*\.)+(com|it|net|uk|de)$/i

(this last one for instance would validate only domain ending with .com/.net/.de/.it/.co.uk). New tlds always come out, so you would have to adjust you regex everytimne a new tld comes out, that's a pain in the neck!

Marco Demaio
  • 33,578
  • 33
  • 128
  • 159
0

You could use parse_url to break up the address into its components. While it's explicitly not built to validate a URL, analyzing the resulting components and matching them against your requirements would at least be a start.

Pekka
  • 442,112
  • 142
  • 972
  • 1,088
0

It may vary but in most of the cases you don't really need to check the validity of any URL.

If it's a vital information and you trust your user enough to let him give it through a URL, you can trust him enough to give a valid URL.

If it isn't a vital information, then you just have to check for XSS attempts and display the URL that the user wanted.

You can add manually a "http://" if you don't detect one to avoid navigation problems.


I know, I don't give you an alternative as a solution, but maybe the best way to solve performance & validity problems is just to avoid unnecessary checks.

Colin Hebert
  • 91,525
  • 15
  • 160
  • 151