I'm looking for a regular expression that would match anything that could be a valid RFC1123 hostname in a string that can contain anything. The idea is to extract everything that could possibly be a hostname (by checking that the substring follows all requirements to be one) - except for the maximum length of 255 characters, which is easy to check on the results afterwards.
I initially came up with:
/(^|[^a-z0-9-])([a-z0-9]([a-z0-9-]{0,61}[a-z0-9])?(\.[a-z0-9]([a-z0-9-]{0,61}[a-z0-9])?)*)([^a-z0-9-]|$)/i
While this matches some hostnames in parenthesized expression 2 (as intended), it seems to skip others. Looking the problem up on stack overflow, I found this related question:
Regular expression to match DNS hostname or IP Address?
Judging by the positive feedback the answer should be correct (although it doesn't verify label size), so I thought I'd give it a try. I converted their expression to an extractable format similar to my previous one:
/(^|[^a-z0-9-])((([a-z0-9]|[a-z0-9][a-z0-9-]*[a-z0-9])\.)*([a-z0-9]|[a-z0-9][a-z0-9-]*[a-z0-9]))([^a-z0-9-]|$)/i
Again, it should return the desired results in parenthesized expression 2, but it appears to skip some valid substrings. I believe there may be a problem with the way I'm checking for delimiters that are not part of the hostname.
Any ideas?