Scrapy and Gearman

Question

I am using Scrapy 1.0.5 and Gearman to create distributed spiders. The idea is to build a spider, call it from a gearman worker script and pass 20 URLs at a time to crawl from a gearman client to the worker and then to the spider.

I am able to start the worker, pass URLs to it from the client on to the spider to crawl. The first URL or array of URLs do get picked up and crawled. Once the spider is done, I am unable to reuse it. I get the log message that the spider is closed. When I initiate the client again, the spider reopens, but doesn't crawl.

Here is my worker:

import gearman
import json
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings


gm_worker = gearman.GearmanWorker(['localhost:4730'])

def task_listener_reverse(gearman_worker, gearman_job):
    process = CrawlerProcess(get_project_settings())

    data = json.loads(gearman_job.data)
    if(data['vendor_name'] == 'walmart'):
        process.crawl('walmart', url=data['url_list'])
        process.start() # the script will block here until the crawling is finished
        return 'completed'

# gm_worker.set_client_id is optional
gm_worker.set_client_id('python-worker')
gm_worker.register_task('reverse', task_listener_reverse)

# Enter our work loop and call gm_worker.after_poll() after each time we timeout/see socket activity
gm_worker.work()

Here is the code of my Spider.

    from crawler.items import CrawlerItemLoader
from scrapy.spiders import Spider




class WalmartSpider(Spider):
    name = "walmart"

    def __init__(self, **kw):
        super(WalmartSpider, self).__init__(**kw)
        self.start_urls = kw.get('url')
        self.allowed_domains = ["walmart.com"]

    def parse(self, response):

        item = CrawlerItemLoader(response=response)

        item.add_value('url', response.url)


        #Title
        item.add_xpath('title', '//div/h1/span/text()')

        if(response.xpath('//div/h1/span/text()')):
            title = response.xpath('//div/h1/span/text()')


        item.add_value('title', title)

        yield item.load_item()

The first client run produces results and I get the data I need whether it was a single URL or multiple URLs.

On the second run, the spider opens and no results. This is what I get back and it stops

    2016-02-19 01:16:30 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2016-02-19 01:16:30 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2016-02-19 01:16:30 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2016-02-19 01:16:30 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2016-02-19 01:16:30 [scrapy] INFO: Enabled item pipelines: MySQLStorePipeline
2016-02-19 01:16:30 [scrapy] INFO: Enabled item pipelines: MySQLStorePipeline
2016-02-19 01:16:30 [scrapy] INFO: Spider opened
2016-02-19 01:16:30 [scrapy] INFO: Spider opened
2016-02-19 01:16:30 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-02-19 01:16:30 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-02-19 01:16:30 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6047
2016-02-19 01:16:30 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6047

I was able to print the URL or URLs from the worker and spider and ensured they were getting passed on the first working run and second non working run. I spent 2 days and haven't gotten anywhere with it. I would appreciate any pointers.

Let me provide more details. I am not married to scrapy. At this point, I am willing to take it out, but I haven't found anything better yet. Gearman on the other hand has worked pretty well. Here is what I am trying to accomplish. I'd like to build a spider, deploy it to multiple machines, start up multiple gearman workers to receive URLs and have the workers call the spider to crawl from the script. I have been able to do this with everything else I have (External APIs), but scrapy continues to fail :( Any direction would help including a scrapy alternative. — user1135662, Feb 20 '16 at 23:13
I also looked into scrapyd. It looks like it's more for broad crawl. Load a list of targets in scrapy.cfg and off it goes. My URLs are loaded by users and only those pages are crawled, these urls change constantly and that's why I would rather feed them through a queue. — user1135662, Feb 20 '16 at 23:23

score 0 · Accepted Answer · answered Mar 08 '16 at 03:20

Well, I decided to abandon Scrapy. I looked around a lot and everyone kept pointing to the limitation of the twisted reactor. Rather than fighting the framework, I decided to build my own scraper and it was very successful for what I needed. I am able to spin up multiple gearman workers and use the scraper I built to scrape the data concurrently in a server farm.

If anyone is interested, I started with this simple article to build the scraper. I use gearman client to query the DB and send multiple urls to a worker, the worker scrapes the URLs and does an update query back to the DB. Success!! :)

http://docs.python-guide.org/en/latest/scenarios/scrape/

Scrapy and Gearman

1 Answers1