7

When using scrapy to scrape a site, I was receiving 503 Service Unavailable as an error right away (could not even start scraping any items). After finding this thread:

How to bypass cloudflare bot/ddos protection in Scrapy?

I assumed the problem was CloudFlare, so I added the following code that uses cfscrape from one of the answers to my Spider:

def start_requests(self):
    cf_requests = []
    for url in self.start_urls:
        token, agent = cfscrape.get_tokens(url, USER_AGENT)
        #token, agent = cfscrape.get_tokens(url)
        cf_requests.append(scrapy.Request(url=url, cookies={'__cfduid': token['__cfduid']}, headers={'User-Agent': agent}))
        print "useragent in cfrequest: " , agent
        print "token in cfrequest: ", token
    return cf_requests

Looking at the output, it seems like this workaround is indeed executing the javascript that CloudFlare uses for ddos protection, but it still gives me 503 error afterwards. Here is the debug output:

2015-11-04 23:07:12 [scrapy] INFO: Scrapy 1.0.3 started (bot: forumscrape)
2015-11-04 23:07:12 [scrapy] INFO: Optional features available: ssl, http11, boto
2015-11-04 23:07:12 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'forumscrape.spiders', 'CONCURRENT_REQUESTS_PER_DOMAIN': 1, 'CONCURRENT_REQUESTS': 1, 'SPIDER_MODULES': ['forumscrape.spiders'], 'CONCURRENT_REQUESTS_PER_IP': 1, 'BOT_NAME': 'forumscrape', 'COOKIES_ENABLED': False, 'DOWNLOAD_DELAY': 1}
2015-11-04 23:07:12 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
2015-11-04 23:07:13 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-11-04 23:07:13 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-11-04 23:07:13 [scrapy] INFO: Enabled item pipelines:
2015-11-04 23:07:13 [requests.packages.urllib3.connectionpool] INFO: Starting new HTTP connection (1): sampleforum.com
2015-11-04 23:07:13 [requests.packages.urllib3.connectionpool] DEBUG: "GET /forumdisplay.php?29-Chat HTTP/1.1" 503 None
2015-11-04 23:07:18 [requests.packages.urllib3.connectionpool] INFO: Starting new HTTP connection (1): sampleforum.com
2015-11-04 23:07:18 [requests.packages.urllib3.connectionpool] DEBUG: "GET /cdn-cgi/l/chk_jschl?jschl_answer=-8011&jschl_vc=6b1abd999393b114b8eea35ff2be9e55&pass=1446696428.397-J92apQZ8k3 HTTP/1.1" 302 165
2015-11-04 23:07:18 [requests.packages.urllib3.connectionpool] DEBUG: "GET /forumdisplay.php?29-Chat HTTP/1.1" 200 21403
useragent in cfrequest:  Mozilla/5.0 (Windows NT 6.1; WOW64; rv:41.0) Gecko/20100101 Firefox/41.0
token in cfrequest:  {'cf_clearance': '037ab6c531be7e6fa6c3d0a98c988f57d17fd781-1446696429-604800', '__cfduid': 'd2635be16360da698f9dd07e4929690ed1446696424'}
2015-11-04 23:07:18 [scrapy] INFO: Spider opened
2015-11-04 23:07:18 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-11-04 23:07:18 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-11-04 23:07:18 [scrapy] DEBUG: Retrying <GET http://sampleforum.com/forumdisplay.php?29-Chat> (failed 1 times): 503 Service Unavailable
2015-11-04 23:07:20 [scrapy] DEBUG: Retrying <GET http://sampleforum.com/forumdisplay.php?29-Chat> (failed 2 times): 503 Service Unavailable
2015-11-04 23:07:21 [scrapy] DEBUG: Gave up retrying <GET http://sampleforum.com/forumdisplay.php?29-Chat> (failed 3 times): 503 Service Unavailable
2015-11-04 23:07:21 [scrapy] DEBUG: Crawled (503) <GET http://sampleforum.com/forumdisplay.php?29-Chat> (referer: None)
2015-11-04 23:07:21 [scrapy] DEBUG: Ignoring response <503 http://sampleforum.com/forumdisplay.php?29-Chat>: HTTP status code is not handled or not allowed
2015-11-04 23:07:21 [scrapy] INFO: Closing spider (finished)
2015-11-04 23:07:21 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 828,
 downloader/request_count': 3,
'downloader/request_method_count/GET': 3,
 'downloader/response_bytes': 14276, 'downloader/response_count': 3, 'downloader/response_status_count/503': 3,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2015, 11, 5, 4, 7, 21, 363000),
'log_count/DEBUG': 9,
'log_count/INFO': 9,
'response_received_count': 1,
'scheduler/dequeued': 3,
'scheduler/dequeued/memory': 3,
'scheduler/enqueued': 3,
'scheduler/enqueued/memory': 3,
'start_time': datetime.datetime(2015, 11, 5, 4, 7, 18, 755000)}
2015-11-04 23:07:21 [scrapy] INFO: Spider closed (finished)

The site loads fine in my browser (the same useragent being used). Other sites that I'm running similar scraping on (just picking up some text) are working. Is there another reason I'm getting 503? Any help would be appreciated.

I believe this line: DEBUG: "GET /cdn-cgi/l/chk_jschl?jschl_answer=23986&jschl_vc=2e88b65c8bcf26f39b980de3d5b198ea&pass=1446698472.387-F9OS39Peei HTTP/1.1" 302 165 Shows that the cloudflare javascript check is being done, so perhaps it's a reason besides this that is causing the 503?

ddnm
  • 187
  • 1
  • 10
  • 1
    I was with the same problem..tried a lot things...what worked for me set `cookie:token`, `header:agent` ...check if in `settings.py` is using the same `agent` and finally returning list in `start_requests` i changed for `yield`. Good lucky – Phillip Kamikaze Nov 20 '15 at 06:01

1 Answers1

1

You can try use splash for avoid cloudflare.

Verz1Lka
  • 406
  • 4
  • 15