-1

There's a website that I need to crawl, I have no financial purpose just to study.

I checked the robots.txt and it was as follows.

User-agent: *

Allow: /

Disallow: /*.notfound.html

Can I crawl this website using request and beautifulSoup?

I checked that crawling without a header causes a 403 error. Does this mean that crawling is not allowed?

Shaido
  • 27,497
  • 23
  • 70
  • 73
lilak0110
  • 3
  • 1
  • [This](https://stackoverflow.com/questions/30681245/robots-txt-file/30681275#:~:text=Web%20site%20owners%20use%20the,Site%20structure) might help... – Kamalesh S Nov 01 '21 at 08:00

1 Answers1

0

status code: 403 means client-side error and from the server-side such type of error is not responsible for meaning the website is allowed to extract data. To get ride of 403 error you must need to inject something with requests like headers and most of the time but not always will solve this problem just injecting User-Agent as header. Here is an example how to inject User-Agent using requests module with BeautifulSoup.

import requests
from bs4 import BeautifulSoup

header = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36"
}

response = requests.get("Your url", headers=headers)
print(response.status_code)

#soup = BeautifulSoup(response .content, "lxml")
Md. Fazlul Hoque
  • 15,806
  • 5
  • 12
  • 32
  • Thanks to you, i solve the problem using a header! I have one more question. I'm going to crawl repeatedly(every minute) and wouldn't it be considered a server attack? – lilak0110 Nov 01 '21 at 08:36