0

I have an existing script (main.py) that requires data to be scraped.

I started a scrapy project for retrieving this data. Now, is there any way main.py can retrieve the data from scrapy as an Item generator, rather than persisting data using the Item pipeline?

Something like this would be really convenient, but I couldn't find out how to do it, if it's feasible at all.

for item in scrapy.process():

I found a potential solution there: https://tryolabs.com/blog/2011/09/27/calling-scrapy-python-script/, using multithreading's queues.

Even though I understand this behaviour is not compatible with distributed crawling, which is what Scrapy is intended for, I'm still a little surprised that you wouldn't have this feature available for smaller projects.

bsuire
  • 1,383
  • 2
  • 18
  • 27
  • There's no way of doing it without some serious hacking which would also require your main.py to be asynchronious. Why not just crawl to file `scrapy crawl myspider -o items.json` and then iterate through that file in your `main.py`? Or ideally just move the whole main.py logic to the spider itself? – Granitosaurus Sep 15 '16 at 10:07

2 Answers2

0

You could send json data out from the crawler and grab the results. It can be done as follows:

Having the spider:

class MySpider(scrapy.Spider):
    # some attributes
    accomulated=[]

    def parse(self, response):
        # do your logic here
        page_text = response.xpath('//text()').extract()
        for text in page_text:
            if conditionsAreOk( text ):
                self.accomulated.append(text)

    def closed( self, reason ):
        # call when the crawler process ends
        print JSON.dumps(self.accomulated)

Write a runner.py script like:

import sys
from twisted.internet import reactor

import scrapy

from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging    
from scrapy.utils.project import get_project_settings

from spiders import MySpider 

def main(argv): 

    url = argv[0]

    configure_logging({'LOG_FORMAT': '%(levelname)s: %(message)s', 'LOG_ENABLED':False })
    runner = CrawlerRunner( get_project_settings() )

    d = runner.crawl( MySpider, url=url)

    # For Multiple in the same process
    #
    # runner.crawl('craw')
    # runner.crawl('craw2')
    # d = runner.join()

    d.addBoth(lambda _: reactor.stop())
    reactor.run() # the script will block here until the crawling is finished


if __name__ == "__main__":
   main(sys.argv[1:])

And then call it from your main.py as:

import json, subprocess, sys, time

def main(argv): 

    # urlArray has http:// or https:// like urls
    for url in urlArray:    
        p = subprocess.Popen(['python', 'runner.py', url ], stdout=subprocess.PIPE, stderr=subprocess.PIPE)
        out, err = p.communicate()

        # do something with your data
        print out
        print json.loads(out)

        # This just helps to watch logs
        time.sleep(0.5)


if __name__ == "__main__":
   main(sys.argv[1:])

Note

This is not the best way of using Scrapy as you know, but for fast results which do not require a complex post processing, this solution can provide what you need.

I hope it helps.

Evhz
  • 8,852
  • 9
  • 51
  • 69
  • thanks! I guess subprocess is one way to send out data between the two modules, without having to persist it. In that case, for stream-like behaviour I can move the json dumping to the parse function? (or a dedicated item pipeline). Because your solution as it is forces me to wait for the crawling to be complete, rather than behaving like a generator. – bsuire Sep 15 '16 at 13:46
  • yes, in non serial crawling you will need a [pipeline](http://doc.scrapy.org/en/latest/topics/item-pipeline.html?highlight=pipelines) – Evhz Sep 15 '16 at 16:27
0

You can do it this way in a Twisted or Tornado app:

import collections

from twisted.internet.defer import Deferred
from scrapy.crawler import Crawler
from scrapy import signals


def scrape_items(crawler_runner, crawler_or_spidercls, *args, **kwargs):
    """
    Start a crawl and return an object (ItemCursor instance)
    which allows to retrieve scraped items and wait for items
    to become available.

    Example:

    .. code-block:: python

        @inlineCallbacks
        def f():
            runner = CrawlerRunner()
            async_items = scrape_items(runner, my_spider)
            while (yield async_items.fetch_next):
                item = async_items.next_item()
                # ...
            # ...

    This convoluted way to write a loop should become unnecessary
    in Python 3.5 because of ``async for``.
    """
    # this requires scrapy >= 1.1rc1
    crawler = crawler_runner.create_crawler(crawler_or_spidercls)
    # for scrapy < 1.1rc1 the following code is needed:
    # crawler = crawler_or_spidercls
    # if not isinstance(crawler_or_spidercls, Crawler):
    #    crawler = crawler_runner._create_crawler(crawler_or_spidercls)

    d = crawler_runner.crawl(crawler, *args, **kwargs)
    return ItemCursor(d, crawler)


class ItemCursor(object):
    def __init__(self, crawl_d, crawler):
        self.crawl_d = crawl_d
        self.crawler = crawler

        crawler.signals.connect(self._on_item_scraped, signals.item_scraped)

        crawl_d.addCallback(self._on_finished)
        crawl_d.addErrback(self._on_error)

        self.closed = False
        self._items_available = Deferred()
        self._items = collections.deque()

    def _on_item_scraped(self, item):
        self._items.append(item)
        self._items_available.callback(True)
        self._items_available = Deferred()

    def _on_finished(self, result):
        self.closed = True
        self._items_available.callback(False)

    def _on_error(self, failure):
        self.closed = True
        self._items_available.errback(failure)

    @property
    def fetch_next(self):
        """
        A Deferred used with ``inlineCallbacks`` or ``gen.coroutine`` to
        asynchronously retrieve the next item, waiting for an item to be
        crawled if necessary. Resolves to ``False`` if the crawl is finished,
        otherwise :meth:`next_item` is guaranteed to return an item
        (a dict or a scrapy.Item instance).
        """
        if self.closed:
            # crawl is finished
            d = Deferred()
            d.callback(False)
            return d

        if self._items:
            # result is ready
            d = Deferred()
            d.callback(True)
            return d

        # We're active, but item is not ready yet. Return a Deferred which
        # resolves to True if item is scraped or to False if crawl is stopped.
        return self._items_available

    def next_item(self):
        """Get a document from the most recently fetched batch, or ``None``.
        See :attr:`fetch_next`.
        """
        if not self._items:
            return None
        return self._items.popleft()

The main idea is to listen to item_scraped signal, and then wrap it to an object with a nicer API.

Note that you need an event loop in your main.py script for this to work; the example above works with twisted.defer.inlineCallbacks or tornado.gen.coroutine.

Mikhail Korobov
  • 21,908
  • 8
  • 73
  • 65