-3

I made a basic regex to find a url:

([a-zA-Z0-9]+\.|)([a-zA-Z0-9\-])+\.[a-z]+[a-zA-Z0-9\?\/\=\-\_]*

([a-zA-Z0-9]+\.|) For a subdomain ([a-zA-Z0-9\-])+ for the hostname \.[a-z]+for the domain [a-zA-Z0-9\?\/\=\-\_]* for the path

When I run this basic program

text = "test.google.com test.google.com"
urls = re.findall("([a-zA-Z0-9]+\.|)([a-zA-Z0-9\-])+\.[a-z]+[a-zA-Z0-9\?\/\=\-\_]*", text)
print(urls)

I get this output [('test.', 'e'), ('test.', 'e')]

I assume it has something to do with my regex, but what? Thanks!

wtreston
  • 1,051
  • 12
  • 28

2 Answers2

-1

The parentheses denote capture groups and this is what is getting returned from findall

Kevin Glasson
  • 408
  • 2
  • 13
-1

Because re.findall would return all the captured chars when capturing group exists. Remove the capturing group or turning it to a non-capturing group will return all the matched chars.

(?:[a-zA-Z0-9]+\.)?[a-zA-Z0-9\-]+\.[a-z]+[a-zA-Z0-9\?\/\=\-\_]*

https://regex101.com/r/efXF9D/1/

or

If you want to capture each part separately then you have to use appropriate capturing group for each.

(?:([a-zA-Z0-9]+)\.)?([a-zA-Z0-9\-]+)\.([a-z]+)([a-zA-Z0-9\?\/\=\-\_]*)

https://regex101.com/r/efXF9D/2/

Avinash Raj
  • 172,303
  • 28
  • 230
  • 274