-1

Honestly, trying to find a solution to this problem has been driving me insane, because every answer is either about using regex to truncate a string, or regex patterns having a max length (in which case, shouldn't it throw an error, not truncate the pattern string?)

Anyways. I'm using a regex pattern supplied by my employer. The intent is to match only the host name in any url string (so like python.org from https://docs.python.org/3/howto/regex.html). I've seen recommendations to use urllib.parse, but it doesn't strip out the hostname properly if there is a subdomain. Here is the regex string I was given to use:

\b(([a-zA-Z0-9\-_]+)\.)+
(?!exe|php|dll|doc|docx|txt|rtf|odt|xls|xlsx|ppt|pptx|bin|
pcap|ioc|pdf|mdb|asp|html|xml|jpg|gif|png|lnk|log|vbs|lco|bat|shell|quit|pdb|vbp|
bdoda|bsspx|save|cpl|wav|tmp|close|py|ico|ini|sleep|run|dat|scr|jar|jxr|apt|w32|css|
js|xpi|class|apk|rar|zip|hlp|tmp|cpp|crl|cfg|cer|plg|tmp)([a-zA-Z]{2,5}|support|report|
i2p|technology|xn--p1ai|com#|moscow|technology)

It's very long. If I place it into a regex checker such as https://pythex.org, it happily tells me that it works perfectly. However, if I use either a Python shell or the Python interpreter, compiling it and then returning the compiled pattern gives me this:

re.compile('\\b(([a-zA-Z0-9\\-_]+)\\.)+(?!exe|php|dll|doc|docx|txt|rtf|odt|xls|xlsx|
ppt|pptx|bin|pcap|ioc|pdf|mdb|asp|html|xml|jpg|gif|png|lnk|log|vbs|lco|bat|shell|quit|
pdb|vbp|bdoda|bsspx|save|cpl|wav|tmp|clos)

Can someone tell me why it's being truncated (for my own knowledge), and suggest a better way to do things? The goal is to do something like this:

https://docs.python.org/3/library/socket.html -> python.org
www.example.info                              -> example.info
docs.google.com                               -> google.com
K. Whitt
  • 3
  • 1
  • 1
    Out[227]: `re.compile(r'\x08(([a-zA-Z0-9\-_]+)\.)+(?!exe|php|dll|doc|docx|txt|rtf|odt|xls|xlsx|ppt|pptx|bin|pcap|ioc|pdf|mdb|asp|html|xml|jpg|gif|png|lnk|log|vbs|lco|bat|shell|quit|pdb|vbp|bdoda|bsspx|save|cpl|wav|tmp|close|py|ico|ini|sleep|run|dat|scr|jar|jxr|apt|w32|css|js|xpi|class|apk|rar|zip|hlp|tmp|cpp|crl|cfg|cer|plg|tmp)([a-zA-Z]{2,5}|support|report|i2p|technology|xn--p1ai|com#|moscow|technology)', re.UNICODE)` . You sure you didn't make a typo somewhere? – Uvar Nov 08 '17 at 19:17
  • 2
    In this case use urllib and build the code to strip the domain name the way you want. But please stop this cheat. – Casimir et Hippolyte Nov 08 '17 at 19:21
  • 2
    The pattern's string representation may be truncated, but the pattern still works as expected. Have you actually used it? – Aran-Fey Nov 08 '17 at 19:22
  • Please calm down, folks. I did not write the regex. But that's a good point about using urllib and then building from there. – K. Whitt Nov 08 '17 at 19:31
  • As for typos, I definitely could have made a mistake. It's long, and keeping track of things is very difficult in it. That's one reason I was looking for another solution. Maybe it has to do with a hard wrap length in my IDE? – K. Whitt Nov 08 '17 at 19:33
  • As far as this goes, I've found the tldextract library, which works wonderfully for my purpose. – K. Whitt Nov 08 '17 at 19:44

1 Answers1

1

Can someone tell me why it's being truncated (for my own knowledge), and suggest a better way to do things?

Python has a regex pattern limit. See this and this. Questions where max limit is reached.

suggest a better way to do things?

Casimir's comment is right though, urllib.parse's urlparse would achieve your intended result in a much neater fashion.

This answer is probably a combination of urlparse and however you determined what is an extension and what isn't. This may help: Get root domain.

HSchmachty
  • 307
  • 1
  • 14
  • Hey, thanks! I'll have to read those links. Regex has always been a confusing beast for me. As far as the parsing itself, I found a perfect solution in tldextract. It separates any url into subdomain, domain, and suffix reliably. – K. Whitt Nov 08 '17 at 20:02
  • Glad you found the perfect solution! I'll have to check it out too. – HSchmachty Nov 08 '17 at 21:21