I made a basic regex to find a url:
([a-zA-Z0-9]+\.|)([a-zA-Z0-9\-])+\.[a-z]+[a-zA-Z0-9\?\/\=\-\_]*
([a-zA-Z0-9]+\.|)
For a subdomain
([a-zA-Z0-9\-])+
for the hostname
\.[a-z]+
for the domain
[a-zA-Z0-9\?\/\=\-\_]*
for the path
When I run this basic program
text = "test.google.com test.google.com"
urls = re.findall("([a-zA-Z0-9]+\.|)([a-zA-Z0-9\-])+\.[a-z]+[a-zA-Z0-9\?\/\=\-\_]*", text)
print(urls)
I get this output
[('test.', 'e'), ('test.', 'e')]
I assume it has something to do with my regex, but what? Thanks!