cannot use a string pattern on a bytes-like object (Python)

Question

i'm creating a crawler in python to list all links in a website but i'm getting an error i can't see what cause it the error is :

Traceback (most recent call last):
  File "vul_scanner.py", line 8, in <module>
    vuln_scanner.crawl(target_url)
  File "C:\Users\Lenovo x240\Documents\website\website\spiders\scanner.py", line 18, in crawl
    href_links= self.extract_links_from(url)
  File "C:\Users\Lenovo x240\Documents\website\website\spiders\scanner.py", line 15, in extract_links_from
    return re.findall('(?:href=")(.*?)"', response.content)
  File "C:\Users\Lenovo x240\AppData\Local\Programs\Python\Python38\lib\re.py", line 241, in findall
    return _compile(pattern, flags).findall(string)
TypeError: cannot use a string pattern on a bytes-like object

my code is : in scanner.py file:

# To ignore numpy errors:
#     pylint: disable=E1101
import urllib
import requests
import re
from urllib.parse import urljoin

class Scanner:
    def __init__(self, url):
        self.target_url = url
        self.target_links = []

    def extract_links_from(self, url):
        response = requests.get(url)
        return re.findall('(?:href=")(.*?)"', response.content)

    def crawl(self, url):
        href_links= self.extract_links_from(url)
        for link in href_links:
            link = urljoin(url, link)   

            if "#" in link:
                link = link.split("#")[0]

            if self.target_url in link and link not in self.target_links:
                self.target_links.append(link)
                print(link)
                self.crawl(link)

in vul_scanner.py file :

import scanner
# To ignore numpy errors:
#     pylint: disable=E1101


target_url = "https://www.amazon.com"
vuln_scanner = scanner.Scanner(target_url)
vuln_scanner.crawl(target_url)

the command i run is : python vul_scanner.py

sharing the full error message might help people to answer your question — Patrick Beynio, Oct 26 '20 at 15:02

score 0 · Accepted Answer · answered Oct 26 '20 at 13:26

return re.findall('(?:href=")(.*?)"', response.content)

response.content in this case is of type binary. So either you use response.text, so you get pure text and can process it as you plan on doing now, or you can check this out:

Regular expression parsing a binary file?

In case you want to continue down the binary road.

Cheers

cannot use a string pattern on a bytes-like object (Python)

1 Answers1