0

I am trying to validate the host section of the url, (not the entire url)

so in the case of http://www.example.com/some/path, all I want to validate is 'www.example.com'.

I have the following regex, ^((?:&#|[[:alnum:]]|[\-_])(?:&#|[[:alnum:]]|[\-\._~\?#\[\]@!$&'\(\)\*\+,;=])*(?::[0-9]{2,})?)$ and it works well in all cases, (including http://localhost and so on).

Looking at https://mathiasbynens.be/demo/url-regex, this all works fine except for 'sites' like http://उदाहरण.परीक्षा and http://⌘.ws, (are those actually allowed?)

If the given host names are possible, what regex could I use, over and above [[:alnum:]] to validate host name like उदाहरण.परीक्षा and उदाहरण.परीक्षा:80?

FFMG
  • 1,208
  • 1
  • 10
  • 24
  • Does it need to be grammatically perfect (say for validation) or can you get away with something simple like: `[^:/]+` for general extraction? – Galik Jul 20 '16 at 09:14
  • I don't think it has to be perfect as such, but it has to be fairly robust I would say. Your regex would return true for spaces, and dots as a first character, (to name just a few). – FFMG Jul 20 '16 at 09:56
  • On the page you linked there are a number of regexes that work for your example, why can't you use those? – Galik Jul 20 '16 at 09:58
  • In the past I've generally used the regex documented in the [rfc](https://www.ietf.org/rfc/rfc3986.txt)(see Appendix B.). – G.M. Jul 20 '16 at 10:08
  • Those regexes are for full url validation, (and are flaky anyway), I just want to validate the host, (and the port), hence the reason I want to know how to match characters like `उदाहरण.परीक्षा` – FFMG Jul 20 '16 at 10:08
  • @G.M., I saw that RFC, but `[^\/?#]+` would suggest that hosts like `...sss!%^@&(*&(*()(!)@(_+_` are valid, (and it even allows multi-lines). – FFMG Jul 20 '16 at 10:15
  • 1
    @FFMG -- Just to clarify. The `[^/?#]+` component of the regex captures the `authority` rather than just the `host` and has the general form `[ userinfo "@" ] host [ ":" port ]` which is at least part of the reason it accepts the text you mention. Hence some more work may be required to identify the `host` only. However, I do get the feeling this may be sending you down the wrong route anyway... – G.M. Jul 20 '16 at 10:50
  • @G.M., Yes I agree, (about the authority), but that regex would allow me to capture invalid characters even if I was to put aside the port/user/password. – FFMG Jul 20 '16 at 12:06

0 Answers0