1

I am a newbie to python scrapy, and wrote a simple script to crawl posts from my school's bbs. However, when my spider runs, it get error messages like this:

015-03-28 11:16:52+0800 [nju_spider] DEBUG: Retrying http://bbs.nju.edu.cn/bbstcon?board=WarAndPeace&file=M.1427299332.A> (failed 2 times): [>] 2015-03-28 11:16:52+0800 [nju_spider] DEBUG: Gave up retrying http://bbs.nju.edu.cn/bbstcon?board=WarAndPeace&file=M.1427281812.A> (failed 3 times): [>] 2015-03-28 11:16:52+0800 [nju_spider] ERROR: Error downloading http://bbs.nju.edu.cn/bbstcon?board=WarAndPeace&file=M.1427281812.A>: [>]

2015-03-28 11:16:56+0800 [nju_spider] INFO: Dumping Scrapy stats: {'downloader/exception_count': 99, 'downloader/exception_type_count/twisted.web._newclient.ResponseFailed': 99, 'downloader/request_bytes': 36236, 'downloader/request_count': 113, 'downloader/request_method_count/GET': 113, 'downloader/response_bytes': 31135, 'downloader/response_count': 14, 'downloader/response_status_count/200': 14, 'dupefilter/filtered': 25, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2015, 3, 28, 3, 16, 56, 677065), 'item_scraped_count': 11, 'log_count/DEBUG': 127, 'log_count/ERROR': 32, 'log_count/INFO': 8, 'request_depth_max': 3, 'response_received_count': 14, 'scheduler/dequeued': 113, 'scheduler/dequeued/memory': 113, 'scheduler/enqueued': 113, 'scheduler/enqueued/memory': 113, 'start_time': datetime.datetime(2015, 3, 28, 3, 16, 41, 874807)} 2015-03-28 11:16:56+0800 [nju_spider] INFO: Spider closed (finished)

It seems that the spider tries the url but fails, but this url does really exists. And there are about thousands of posts in the bbs, but every time I ran my spider, it can only get a random few of them. My code is like following, and really appreciate for your help

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor

from ScrapyTest.items import NjuPostItem


class NjuSpider(CrawlSpider):
    name = 'nju_spider'
    allowed_domains = ['bbs.nju.edu.cn']
    start_urls = ['http://bbs.nju.edu.cn/bbstdoc?board=WarAndPeace']
    rules = [Rule(LinkExtractor(allow=['bbstcon\?board=WarAndPeace&file=M\.\d+\.A']),
              callback='parse_post'),
             Rule(LinkExtractor(allow=['bbstdoc\?board=WarAndPeace&start=\d+']),
              follow=True)]

    def parse_post(self, response):
        # self.log('A response from %s just arrived!' % response.url)
        post = NjuPostItem()
        post['url'] = response.url
        post['title'] = 'to_do'
        post['content'] = 'to_do'
        return post
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
Ron
  • 53
  • 4

1 Answers1

2

First, make sure you are not violating the web-site's Terms of Use by taking the web-scraping approach. Be a good web-scraping citizen.

Next, you can set the User-Agent header to pretend to be a browser. Either provide a User-Agent in the DEFAULT_REQUEST_HEADERS setting:

DEFAULT_REQUEST_HEADERS = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.104 Safari/537.36'
}

or, you can rotate User Agents with a middleware. Here is the one I've implemented based on fake-useragent package:


Another possible problem could be that you are hitting the web-site too often, consider tweaking DOWNLOAD_DELAY setting:

The amount of time (in secs) that the downloader should wait before downloading consecutive pages from the same website. This can be used to throttle the crawling speed to avoid hitting servers too hard.

There is an another relevant setting that can have a positive impact: CONCURRENT_REQUESTS:

The maximum number of concurrent (ie. simultaneous) requests that will be performed by the Scrapy downloader.

Community
  • 1
  • 1
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
  • Thanks a lot! It seems that the website I want to scribe forbid fetching too often, so after I set the DOWNLOAD_DELAY to 2, it works well. – Ron Mar 30 '15 at 08:36