0

New guy here, I am currently writing a web scraper for an exercise and I have encountered a problem with extracting the url to re-use. Basically I managed to get the URL but when I print it, it is still showing the [' '] (for example: ['http://123.com'] so it cannot be used as an input.

I am extracting the string using re.findall but then I tried to use .strip and .replace but it's I'm either getting a traceback or the input remains the same. Any suggestions please?

Extract:

z = re.findall(r'(?=htt).*?(?<=htm)', y)
h = str(z)
h = h.strip('\['"')
print(h)
khelwood
  • 55,782
  • 14
  • 81
  • 108
  • 3
    `['http://123.com']`—That's a string inside a list. The string itself does not contain the brackets or quotes. – khelwood Nov 23 '20 at 09:15
  • 2
    `z` is returning a list; if u just want the first element access by `z[0]` – Serial Lazer Nov 23 '20 at 09:17
  • 1
    Note that the entire purpose of ``re.findall`` is to find *several* matches. If you want only one matched string, use ``re.search`` or ``re.match``. – MisterMiyagi Nov 23 '20 at 09:17

2 Answers2

1

re.findall returns a list. Lists don't have strip or replace methods. Access the element of the list by using z[0]. You could also use re.search if you're only looking for one string.

Ben
  • 82
  • 11
0

Just like the answers in the comments, you can simply iterate over the list to access the elements inside like so:

for i in z:
    print(i)

You can substitute other methods instead of the print statement.

Gauthum.J
  • 44
  • 4
  • Hello, I actually found a method, I do not know if it is the correct one but it works: y = str(tag) z = re.findall(r'(?=htt).*?(?<=htm)', y) e = str(z) f = len(e) f = int(f) h = f-2 r = e[2:h] print(r) Basically I reasoned that if if I find out the length of the string (which varies) than trim or print only the range I need it works. As I said I do not know if it is the correct method but it worked. – Christian Grech Nov 24 '20 at 10:55