2

I'm very new to web scraping and and as a first project(in order to learn) I wanted to create a database for house prices. Later on I'm going to feed it to ML algorithms to see if I'm going to be able to predict the prices but I cannot fetch the page. I'm getting this:

In [1]: fetch("https://www.sahibinden.com")
2020-11-07 01:37:34 [scrapy.core.engine] INFO: Spider opened
2020-11-07 01:37:34 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.sahibinden.com> (failed 1 times): 429 Unknown Status
2020-11-07 01:37:34 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.sahibinden.com> (failed 2 times): 429 Unknown Status
2020-11-07 01:37:34 [scrapy.downloadermiddlewares.retry] ERROR: Gave up retrying <GET https://www.sahibinden.com> (failed 3 times): 429 Unknown Status
2020-11-07 01:37:34 [scrapy.core.engine] DEBUG: Crawled (429) <GET https://www.sahibinden.com> (referer: None)

The last Crawled (429) message yields an error page. That's obviously not the page I'm looking for. I'm getting 200 from any other website. Only this website is problematic. Is there a way to fix this?

Aras Uludağ
  • 113
  • 2
  • 7

3 Answers3

1

429 HTTP status code means too many requests. Your requests to this site has been reached to limits. Many services define request per second limit to avoid DOS. You have to pause between your requests. But how long? You need to try more to estimate appropriate sleeping/pausing time. Pausing could be defined after each request or after a bunch of requests. You can use time.sleep() for pausing.

Pouya Esmaeili
  • 1,265
  • 4
  • 11
  • 25
  • Thank you for the response. Yeah, I've read that but the thing is, I'm getting this error on the first request. 429 is the response to this: fetch("https://sahibinden.com/"). This is why I am very confused. So weird. – Aras Uludağ Nov 06 '20 at 22:35
  • 2
    Define User Agent field of your requests as a browser. Some services validate this field to avoid scraping. – Pouya Esmaeili Nov 07 '20 at 15:42
0

I had the same problem, and it was resolved by following the steps mentioned in this thread. It contains code to respond to a 429 response, by waiting a while before throwing a new request.

How to handle a 429 Too Many Requests response in Scrapy?

xqzy
  • 46
  • 6
0

If you are new to web scraping you have to know that HTTP 429 is going to be your new friend.

However, I've found out a nice workaround to IP blocking when scraping sites. It lets you run a Scraper indefinitely by running it from Google App Engine and redeploying it automatically when you get a 429.

Check out my article here