I am a newbie to python scrapy, and wrote a simple script to crawl posts from my school's bbs. However, when my spider runs, it get error messages like this:
015-03-28 11:16:52+0800 [nju_spider] DEBUG: Retrying http://bbs.nju.edu.cn/bbstcon?board=WarAndPeace&file=M.1427299332.A> (failed 2 times): [>] 2015-03-28 11:16:52+0800 [nju_spider] DEBUG: Gave up retrying http://bbs.nju.edu.cn/bbstcon?board=WarAndPeace&file=M.1427281812.A> (failed 3 times): [>] 2015-03-28 11:16:52+0800 [nju_spider] ERROR: Error downloading http://bbs.nju.edu.cn/bbstcon?board=WarAndPeace&file=M.1427281812.A>: [>]
2015-03-28 11:16:56+0800 [nju_spider] INFO: Dumping Scrapy stats: {'downloader/exception_count': 99, 'downloader/exception_type_count/twisted.web._newclient.ResponseFailed': 99, 'downloader/request_bytes': 36236, 'downloader/request_count': 113, 'downloader/request_method_count/GET': 113, 'downloader/response_bytes': 31135, 'downloader/response_count': 14, 'downloader/response_status_count/200': 14, 'dupefilter/filtered': 25, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2015, 3, 28, 3, 16, 56, 677065), 'item_scraped_count': 11, 'log_count/DEBUG': 127, 'log_count/ERROR': 32, 'log_count/INFO': 8, 'request_depth_max': 3, 'response_received_count': 14, 'scheduler/dequeued': 113, 'scheduler/dequeued/memory': 113, 'scheduler/enqueued': 113, 'scheduler/enqueued/memory': 113, 'start_time': datetime.datetime(2015, 3, 28, 3, 16, 41, 874807)} 2015-03-28 11:16:56+0800 [nju_spider] INFO: Spider closed (finished)
It seems that the spider tries the url but fails, but this url does really exists. And there are about thousands of posts in the bbs, but every time I ran my spider, it can only get a random few of them. My code is like following, and really appreciate for your help
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
from ScrapyTest.items import NjuPostItem
class NjuSpider(CrawlSpider):
name = 'nju_spider'
allowed_domains = ['bbs.nju.edu.cn']
start_urls = ['http://bbs.nju.edu.cn/bbstdoc?board=WarAndPeace']
rules = [Rule(LinkExtractor(allow=['bbstcon\?board=WarAndPeace&file=M\.\d+\.A']),
callback='parse_post'),
Rule(LinkExtractor(allow=['bbstdoc\?board=WarAndPeace&start=\d+']),
follow=True)]
def parse_post(self, response):
# self.log('A response from %s just arrived!' % response.url)
post = NjuPostItem()
post['url'] = response.url
post['title'] = 'to_do'
post['content'] = 'to_do'
return post