2

I am working in a crawling project, using Scrapy, and I need to distribute my spiders across different nodes in a cluster to make the process faster. I am using ScrapydWeb to manage it and I have already configured two machines, one of them with ScrapydWeb up and both with Scrapyd up. The Web App recognizes both and I can run my spider properly. The problem is that the crawling is running just in parallel (the content is being fetched by both machines), and my purpose was to do it in a distributed way to minimize the crawling time.

Could anybody help me? Thank you in advance.

2 Answers2

1

I don't think Scrapyd & ScrapydWeb offer the possibility of running a spiders across different servers other than just fully running the same spider. If you want to distribute the crawling you can either:

  • Run 1 spider only on 1 server
  • If you need actual distributed crawling (where the same spider runs across different machines without multiple machines parsing the same url), you can look into Scrapy-Cluster
  • You can write custom code where you have 1 process generating the urls to scrape on one side, put the found urls in a queue (using Redis f.e.), and have multiple servers popping urls from this queue to fetch & parse the page
Wim Hermans
  • 2,098
  • 1
  • 9
  • 16
0

I used Scrapy Cluster to solve the problem and I'm sharing my experience:

Docker installation was hard for me to control and debug, so I tried the Cluster Quick-start and it worked better.

I have five machines available in my cluster and I used one to host the Apache Kafka, as well as the Zookeeper. I also had one for Redis DB. It's important to make sure those machines are available for external access from the ones you are going to use for spidering.

Once these three components were properly installed and running, I installed Scrapy Cluster's requirements in a python3.6 environment. Then, I configured a local settings file with the IP address for the hosts and made sure all online and offline tests passed.

Everything set up, I was able to run the first spider (the official documentation provides an example). The idea is that you create instances for your spider (you can, for example, use tmux to open 10 different terminal windows and run one instance at each). When you feed Apache Kafka with a URL to be crawled, it's sent to a queue at Redis, to where your instances will periodically look for a new page to crawl.

If your spider generates more URLs from the one you passed initially, they return to Redis to be possibly crawled by other instances. And that's where you can see the power of this distribution.

Once a page is crawled, the result is sent to a Kafka topic.

The official documentation is huge and you can find more details on the installation and setup.