4

I have this dilemma on how to store all of my spiders. These spiders will be used by fed into Apache NiFi using a command line invocation and items read from stdin. I also plan to have a subset of these spiders return single item results using scrapyrt on a separate web server. I will need to create spiders across many different projects with different item models. They will all have similar settings (like use the same proxy).

My question is what is the best way to structure my scrapy project(s)?

  1. Place all spiders in the same repository. Provides an easy way to make base classes for Item loaders and Item pipelines.
  2. Group spiders for each project I am working on into separate repositories. This has the advantage of allowing Items to be the focal point of each project and not get too large. Unable to share common code, settings, spider monitors (spidermon), and base classes. This feels the cleanest though even though there is some duplications.
  3. Package only the spiders I plan to use non-realtime in the NiFi repo and the realtime ones in another repo. Has the advantage I keep the spiders with the projects that will actually use them but still centralizes/convolutes which spiders are used with which projects.

It feels like the right answer is #2. Spiders related to a specific program should be in their own scrapy project just like when you create a web service for project A, you don't say oh I can just throw all of my service endpoints for project B into the same service because that is where all my services will live, even though some settings may be duplicated. Arguably some of the shared code/classes could be shared through a separate package.

What do you think? How are you all structuring your scrapy projects to maximize reusability? Where do you draw the line of same project vs. separate project? Is it based on your Item model or data source?

Adam
  • 4,590
  • 10
  • 51
  • 84
Lijo
  • 43
  • 3
  • 1
    Personally I use different project for different websites. It does mean I don't separate spiders by project because a websites can have more than one spider for me. When you say "Arguably some of the shared code/classes could be shared through a separate package." create a module directory at the top of the tree or anywhere else that are right for you and add the path to `sys.path` in your `spiders.py` module to `import` your customed module of "shared code/classes". More generally, any pattern is valid if you limit the number of lines, improve readability, limit the bugs too. – AvyWam Sep 09 '19 at 21:25
  • Thanks for the feedback. It seems like the primary distinction of creating separate projects is centered around the item model. if you have 10 websites that all pump out the same type of item like a products for purchase which will always have a name, price, and description then basing it around a specific project you're working on as well as the data model of that project seems to be a good approach. I sort of like the thought of keeping a common folder with shared classes on the path but at that point it seems like one shared repository like option one would then be preferable. – Lijo Sep 09 '19 at 23:28
  • Are there no other ideas? Any experts on hear that can weigh in? I'm open to other approaches. – Lijo Sep 10 '19 at 22:02
  • I don't really agree when you say it's centered around item model. Maybe tomorrow I will write an answer to show you different things. I don't know your level in programming but the structure of your project (I don't mean scrapy project but project in general sens) depends of your skills above all. I do not structurate my project as I did when I began, and I know it was because of lack of skills. The more you will be experienced, the more your structure will change. And functional programming given by python is very powerfull. Well I'll do that as soon as possible. – AvyWam Sep 10 '19 at 22:27
  • Thanks @AvyWam that would be very helpful to know how you layout your projects. – Lijo Sep 12 '19 at 12:32

2 Answers2

4

First of all, when I write a path like '/path' that's because I am a Ubuntu user. Adapt it if you are a Windows user. That's a matter of file management system.

Light example

Let's imagine you want to scrape 2 different websites or more. The first one is a retail website for swimsuits. The second is about weather. You want to scrape both because you want to observe the link between swimsuits prices and weather in order to anticipate the lower price to purchase.

Note in the pipelines.py I will use mongo collection because this what I use, I never need SQL for the moment. If you don't know mongo, consider a collection is the equivalent of a table in relational database.

The scrapy project could look like the following:

spiderswebsites.py , here you can write the number of spiders you want.

import scrapy
from ..items.py import SwimItem, WeatherItem
#if sometimes you have trouble to import from parent directory you can do
#import sys
#sys.path.append('/path/parentDirectory')

class SwimSpider(scrapy.Spider):
    name = "swimsuit"
    start_urls = ['https://www.swimsuit.com']
    def parse (self, response):
        price = response.xpath('span[@class="price"]/text()').extract()
        model = response.xpath('span[@class="model"]/text()').extract()
        ... # and so on
        item = SwimItem() #needs to be called -> ()
        item['price'] = price
        item['model'] = model
        ... # and so on
        return item

class WeatherSpider(scrapy.Spider):
    name = "weather"
    start_urls = ['https://www.weather.com']
    def parse (self, response):
        temperature = response.xpath('span[@class="temp"]/text()').extract()
        cloud = response.xpath('span[@class="cloud_perc"]/text()').extract()
        ... # and so on
        item = WeatherItem() #needs to be called -> ()
        item['temperature'] = temperature
        item['cloud'] = cloud
        ... # and so on
        return item

items.py, here you can write the number of item patterns you want.

import scrapy
class SwimItem(scrapy.Item):
    price = scrapy.Field()
    stock = scrapy.Field()
    ...
    model = scrapy.Field()

class WeatherItem(scrapy.Item):
    temperature = scrapy.Field()
    cloud = scrapy.Field()
    ...
    pressure = scrapy.Field()

pipelines.py , where I use Mongo

import pymongo
from .items import SwimItem,WeatherItem
from .spiders.spiderswebsites import SwimSpider , WeatherSpider

class ScrapePipeline(object):

    def __init__(self, mongo_uri, mongo_db):
        self.mongo_uri = mongo_uri
        self.mongo_db = mongo_db

    @classmethod #this is a decorator, that's a powerful tool in Python
    def from_crawler(cls, crawler):
        return cls(
        mongo_uri=crawler.settings.get('MONGODB_URL'),
        mongo_db=crawler.settings.get('MONGODB_DB', 'defautlt-test')
        )
    def open_spider(self, spider):
        self.client = pymongo.MongoClient(self.mongo_uri)
        self.db = self.client[self.mongo_db]
        
    def close_spider(self, spider):
         self.client.close()

    def process_item(self, item, spider):
        if isinstance(spider, SwimItem):
            self.collection_name = 'swimwebsite'
        elif isinstance(spider, WeatherItem):
            self.collection_name = 'weatherwebsite'
        self.db[self.collection_name].insert(dict(item))

So when you look at my example's project, you see the project is not depending at all of a pattern of item, because you can use several kind of items in the same projects. In the pattern above, the advantage is you keep the same configurations in settings.py if you need. But do not forget you can "custom" the command of your spider. If you want your spider to run a little differently from the default settings you can set as scrapy crawl spider -s DOWNLOAD_DELAY=35 instead of 25 that you wrote in settings.py for instance.

Functional programming

Moreover here my example is light. In the reality you are rarely interested in the raw data. You need many treatments which represents a lot of lines. To increase readability of your code, you can create functions in modules. But be careful with spaghetti code.

functions.py , customized module

from re import search

def cloud_temp(response): #for WeatherSpider
    """returns a tuple containing temperature and percentage of clouds"""
    temperature = response.xpath('span[@class="temp"]/text()').extract() #returns a str as " 12°C"
    cloud = response.xpath('span[@class="cloud_perc"]/text()').extract() #returns a str as "30%"
    #treatments, you want to record it as integer
    temperature = int(re.search(r'[0-9]+',temperature).group()) #returns int as 12
    cloud = int(re.search(r'[0-9]+',cloud).group()) #returns int as 30
    return (cloud,temperature)

it gives in spiders.py

import scrapy
from items.py import SwimItem, WeatherItem
from functions.py import *
...
class WeatherSpider(scrapy.Spider):
    name = "weather"
    start_urls = ['https://www.weather.com']
    def parse (self, response):
        cloud , temperature = cloud_temp(response) "this is shorter than the previous one
        ... # and so on
        item = WeatherItem() #needs to be called -> ()
        item['temperature'] = temperature
        item['cloud'] = cloud
        ... # and so on
        return item

Besides, it gives a considerable improvement in debugging. Let's say I want to make a scrapy shell session.

>>> scrapy shell https://www.weather.com
...
#I check in the sys path if the directory where my `functions.py` module is present.
>>> import sys
>>> sys.path #returns a list of paths
>>> #if the directory is not present
>>> sys.path.insert(0, '/path/directory')
>>> #then I can now import my module in this session, and test in the shell, while I modify in the file functions.py itself
>>> from functions.py import *
>>> cloud_temp(response) #checking if it returns what I want.

This is more comfortable than to copy and paste a piece of code. And because Python is a great programming language for functional programming you should benefit from it. This is why I told you "More generally, any pattern is valid if you limit the number of lines, improve readability, limit the bugs too." The more it is readable the more you limit the bugs. The less the number of lines you write (like to avoid to copy and paste the same treatment for different variable), the less you limit the bugs. Because when you correct in a function itself, you correct for all which depends on it.

So now, if you're not very comfortable with functional programming, I can understand you make several projects for different items patterns. You do with your current skills and improve them then improve your code with the time.

AvyWam
  • 890
  • 8
  • 28
4

Jakob from the Google Group Thread titled "Single Scrapy Project vs. Multiple Projects for Various Sources" recommended:

whether spiders should go into the same project is mainly determined by the type of data they scrape, and not by where the data comes from.

Say you are scraping user profiles from all your target sites, then you may have an item pipeline that cleans and validates user avatars, and one that exports them into your "avatars" database. It makes sense to put all spiders into the same project. After all, they all use the same pipelines because the data always has the same shape no matter where it was scraped from. On the other hand, if you are scraping questions from Stack Overflow, user profiles from Wikipedia, and issues from Github, and you validate/process/export all of these data types differently, it would make more sense to put the spiders into separate projects.

In other words, if your spiders have common dependencies (e.g. they share item definitions/pipelines/middlewares), they probably belong into the same project; if each of them has their own specific dependencies, they probably belong into separate projects.

Pablo Hoffman is one of the developers of Scrapy and he responded in another thread "Scrapy spider vs project" with:

...recommend to keep all spiders into the same project to improve code reusability (common code, helper functions, etc).

We've used prefixes on spider names at times, like film_spider1, film_spider2 actor_spider1, actor_spider2, etc. And sometimes we also write spiders that scrape multiple item types, as it makes more sense when there is a big overlap on the pages crawled.

Adam
  • 4,590
  • 10
  • 51
  • 84