Honestly, trying to find a solution to this problem has been driving me insane, because every answer is either about using regex to truncate a string, or regex patterns having a max length (in which case, shouldn't it throw an error, not truncate the pattern string?)
Anyways. I'm using a regex pattern supplied by my employer. The intent is to match only the host name in any url string (so like python.org from https://docs.python.org/3/howto/regex.html). I've seen recommendations to use urllib.parse, but it doesn't strip out the hostname properly if there is a subdomain. Here is the regex string I was given to use:
\b(([a-zA-Z0-9\-_]+)\.)+
(?!exe|php|dll|doc|docx|txt|rtf|odt|xls|xlsx|ppt|pptx|bin|
pcap|ioc|pdf|mdb|asp|html|xml|jpg|gif|png|lnk|log|vbs|lco|bat|shell|quit|pdb|vbp|
bdoda|bsspx|save|cpl|wav|tmp|close|py|ico|ini|sleep|run|dat|scr|jar|jxr|apt|w32|css|
js|xpi|class|apk|rar|zip|hlp|tmp|cpp|crl|cfg|cer|plg|tmp)([a-zA-Z]{2,5}|support|report|
i2p|technology|xn--p1ai|com#|moscow|technology)
It's very long. If I place it into a regex checker such as https://pythex.org, it happily tells me that it works perfectly. However, if I use either a Python shell or the Python interpreter, compiling it and then returning the compiled pattern gives me this:
re.compile('\\b(([a-zA-Z0-9\\-_]+)\\.)+(?!exe|php|dll|doc|docx|txt|rtf|odt|xls|xlsx|
ppt|pptx|bin|pcap|ioc|pdf|mdb|asp|html|xml|jpg|gif|png|lnk|log|vbs|lco|bat|shell|quit|
pdb|vbp|bdoda|bsspx|save|cpl|wav|tmp|clos)
Can someone tell me why it's being truncated (for my own knowledge), and suggest a better way to do things? The goal is to do something like this:
https://docs.python.org/3/library/socket.html -> python.org
www.example.info -> example.info
docs.google.com -> google.com