1

I use scrapy-redis simple to build a distributed crawler, slave machine needs to read url form master queue url, but there is a problem is that I get to url slave machine is after cPikle converted data, I want to get url from redis-url-queue is correct, what do you suggest?

Example:

from scrapy_redis.spiders import RedisSpider
from scrapy.spider import Spider
from example.items import ExampleLoader
class MySpider(RedisSpider):
"""Spider that reads urls from redis queue (myspider:start_urls)."""
    name = 'redisspider'
    redis_key = 'wzws:requests'

    def __init__(self, *args, **kwargs):
        super(MySpider, self).__init__(*args, **kwargs)

    def parse(self, response):
        el = ExampleLoader(response=response)
        el.add_xpath('name', '//title[1]/text()')
        el.add_value('url', response.url)
        return el.load_item()

MySpider inherited the RedisSpider, when I run scrapy runspider myspider_redis.py it occurs not legal url

scrapy-redis github address:scrapy-redis

rowele
  • 85
  • 10
  • Yeah, I can get logs NotSupported: Unsupported URL scheme '': no handler available for that scheme,the url is cPikle data i get – rowele Mar 22 '16 at 01:42

1 Answers1

1

There are a few internal queues used in scrapy-redis. One is for start urls (by default <spider>:start_urls), other for shared requests (by default <spider>:requests) and another for the dupefilter.

The start urls queue and requests queue can't be the same as start urls queue expects single string values and the requests queue expects pickled data.

So, you should not be using <spider>:requests as redis_key in the spider.

Let me know if this helps, otherwise please share the messages in the redis_key queue.

R. Max
  • 6,624
  • 1
  • 27
  • 34