3

How can I send the scraped URL's from one spider to the start_urls of another spider?

Specifically, I want to run one spider which gets a list of URL's from an XML page. After the URL's have been retrieved I want them to by used by another spider for scraping.

from scrapy.spiders import SitemapSpider

class Daily(SitemapSpider):
    name = 'daily'
    sitemap_urls = ['http://example.com/sitemap.xml']

    def parse(self, response):
        print response.url

        # How do I send these URL's to another spider instead?

        yield {
            'url': response.url
        }
AppTest
  • 491
  • 1
  • 7
  • 23

3 Answers3

1

From first spider you can save url in some DB or send to some queue (Zerro, Rabbit MQ, Redis) for example via pipeline.

Second spider can get the url with method - start_requests

class MySpider(scrapy.Spider):
    name = 'myspider'

    def start_requests(self):
        urls = my_db.orm.get('urls');
        for url in urls:
            yield scrapy.Request(url)

Or urls can be passed to spider from queue broker via cli or API. Or spider can be just launched from broker and launched spider get his url by start_requests.

Really exists many ways how you can do it. The way depend of the criteria why you need to pass urls from one spider to other.

You can check this projects: Scrapy-Cluster, Scrapy-Redis. May be its what you searching for.

Alisher Gafurov
  • 449
  • 5
  • 15
0

Write the URLs to a file as strings. Read them from the same file in the other spider.

DYZ
  • 55,249
  • 10
  • 64
  • 93
0

Why you want to use different spiders for such requirment?

You can just have 1 spider and then instead of passing URL to another spider, just yield another Request in your parse method.

from scrapy.spiders import SitemapSpider

class Daily(SitemapSpider):
    name = 'daily'
    sitemap_urls = ['http://example.com/sitemap.xml']

    def parse(self, response):

        yield Request(url="URL here", callback=callback_function)
Umair Ayub
  • 19,358
  • 14
  • 72
  • 146
  • 2
    for example, spider A might specialized to deal with a particular website structure and spider B might be generic/specialized to deal with a different website structure. This answer doesn't answer the question at all. – maestromusica Aug 09 '17 at 08:53
  • @maestromusica have simple if-else condition to determine the domain, and use different logic for each website – Umair Ayub Aug 09 '17 at 08:54