1

I'm looking for a regular expression that would match anything that could be a valid RFC1123 hostname in a string that can contain anything. The idea is to extract everything that could possibly be a hostname (by checking that the substring follows all requirements to be one) - except for the maximum length of 255 characters, which is easy to check on the results afterwards.

I initially came up with:

/(^|[^a-z0-9-])([a-z0-9]([a-z0-9-]{0,61}[a-z0-9])?(\.[a-z0-9]([a-z0-9-]{0,61}[a-z0-9])?)*)([^a-z0-9-]|$)/i

While this matches some hostnames in parenthesized expression 2 (as intended), it seems to skip others. Looking the problem up on stack overflow, I found this related question:

Regular expression to match DNS hostname or IP Address?

Judging by the positive feedback the answer should be correct (although it doesn't verify label size), so I thought I'd give it a try. I converted their expression to an extractable format similar to my previous one:

/(^|[^a-z0-9-])((([a-z0-9]|[a-z0-9][a-z0-9-]*[a-z0-9])\.)*([a-z0-9]|[a-z0-9][a-z0-9-]*[a-z0-9]))([^a-z0-9-]|$)/i

Again, it should return the desired results in parenthesized expression 2, but it appears to skip some valid substrings. I believe there may be a problem with the way I'm checking for delimiters that are not part of the hostname.

Any ideas?

Community
  • 1
  • 1
Protected
  • 362
  • 1
  • 7
  • 16

1 Answers1

1

Figured it out. When scanning a string for sequential matches, using delimiters both before and after the desired expression means two characters must be consummed between each pair of hostnames. So when hostnames are only one character apart, the second one is skipped!

To obtain correct results one must simply remove the leading delimiter:

/([a-z0-9]([a-z0-9-]{0,61}[a-z0-9])?(\.[a-z0-9]([a-z0-9-]{0,61}[a-z0-9])?)*)([^a-z0-9-]|$)/i

It is only necessary for validation, not scanning.

Protected
  • 362
  • 1
  • 7
  • 16