1

Update. Assume, that domain name are the last two items of the host name, except the second is co or com, in which case, domain name are the last three items. If there is just one item -> it is the domain name.

That the minimum cases to handle:

http://google.com          -> google.com
http://www.google.com      -> google.com
http://abc.cde.google.com  -> google.com
http://google.co.uk        -> google.co.uk
http://www.google.com.au   -> google.com.au
http://www.mysite.info     -> mysite.info
http://www.mysite.business -> mysite.business
http://localhost           -> localhost

Regex sandbox for this question

Here are the tests and some starting regexp https://regex101.com/r/AyuW88/3

As a bonus, a few more cases (but I would be already very happy if regex works just with the former cases)

http://google.com:8080      -> google.com
http://www.google.com?q=abc -> google.com
http://www.google.com/smth  -> google.com
Alexei Vinogradov
  • 1,548
  • 3
  • 15
  • 34
  • 2
    Alexei, I think that you have a problem, I think it's not possible for a simple regex to differentiate between a domain like google.co.uk (that you want) and a domain like cde.google.com (where you want only google.com) without some domain knowledge (no pun intended) – Ass3mbler Jan 05 '19 at 00:12
  • I agree. The example regex assumes, that ending xx(x).yy(y) identifies tlds like co.uk, com.au etc. But there are of course "normal" 2 and 3 letters domains, like ya.ru and gmx.de. I think I update the my question to make it solvable. Thanks for your remark – Alexei Vinogradov Jan 05 '19 at 16:42

2 Answers2

1

This should work for your simple cases:

 r'([^\/\.]+\.(com|co)\.\w+|[^\/\.]+.\w+)$'

Captured in group 1. Your assumption "except the second is co or com" is hardcoded in the regex. Also, there is a typo on the line:

http://www.google.com.au   -> google.com.ua

Should be "google.com.au"

user2468968
  • 286
  • 3
  • 9
  • This is not how second-level domains work and this regex is easily broken. There are something like 1500 combinations of sublevel domains and I've yet to find a regex capable of distinguishing between a domain with a second-level domain and a domain with multiple sub domains. This is an incredibly difficult regex to write considering the vast amount of second-level domains you'd have to account for (assuming you wanted one that accounted for all of them). – GroggyOtter Jan 20 '23 at 23:23
-1

This regex should address your use case.

Regex: (?<=http(s)?:\/\/).*

Explanation:
(?<=http(s)?:\/\/): Positive lookback, to see if word is http or https.
.*: Will capture everything after that.

Link: https://regex101.com/r/fX1fI5/130

Hope this helps.

Deep
  • 342
  • 3
  • 12