I have a scrapy crawler that scrapes data off a website and uploads the scraped data to a remote MongoDB server. I wanted to host it on heroku to scrape automatically for a long time.
I am using scrapy-user-agents to rotate between different user agents.
When I use scrapy crawl <spider>
locally on my pc, the spider runs correctly and returns data to the MongoDB database.
However, when I deploy the project on heroku, I get the following lines in my heroku logs :
2020-12-22T12:50:21.132731+00:00 app[web.1]: 2020-12-22 12:50:21 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://indiankanoon.org/browse/> (failed 1 times): 503 Service Unavailable
2020-12-22T12:50:21.134186+00:00 app[web.1]: 2020-12-22 12:50:21 [scrapy_user_agents.middlewares] DEBUG: Assigned User-Agent Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36
(it fails similarly for 9 times until:)
2020-12-22T12:50:23.594655+00:00 app[web.1]: 2020-12-22 12:50:23 [scrapy.downloadermiddlewares.retry] ERROR: Gave up retrying <GET https://indiankanoon.org/browse/> (failed 9 times): 503 Service Unavailable
2020-12-22T12:50:23.599310+00:00 app[web.1]: 2020-12-22 12:50:23 [scrapy.core.engine] DEBUG: Crawled (503) <GET https://indiankanoon.org/browse/> (referer: None)
2020-12-22T12:50:23.701386+00:00 app[web.1]: 2020-12-22 12:50:23 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <503 https://indiankanoon.org/browse/>: HTTP status code is not handled or not allowed
2020-12-22T12:50:23.714834+00:00 app[web.1]: 2020-12-22 12:50:23 [scrapy.core.engine] INFO: Closing spider (finished)
In summary, my local IP address is able to scrape the data while when Heroku tries, it is unable to. Can changing something in the settings.py file correct it?
My settings.py file :
BOT_NAME = 'indKanoon'
SPIDER_MODULES = ['indKanoon.spiders']
NEWSPIDER_MODULE = 'indKanoon.spiders'
MONGO_URI = ''
MONGO_DATABASE = 'casecounts'
ROBOTSTXT_OBEY = False
CONCURRENT_REQUESTS = 32
DOWNLOAD_DELAY = 3
COOKIES_ENABLED = False
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
'scrapy_user_agents.middlewares.RandomUserAgentMiddleware': 400,
}
ITEM_PIPELINES = {
'indKanoon.pipelines.IndkanoonPipeline': 300,
}
RETRY_ENABLED = True
RETRY_TIMES = 8
RETRY_HTTP_CODES = [500, 502, 503, 504, 522, 524, 408]