First of all, when I write a path like '/path'
that's because I am a Ubuntu user. Adapt it if you are a Windows user. That's a matter of file management system.
Light example
Let's imagine you want to scrape 2 different websites or more.
The first one is a retail website for swimsuits. The second is about weather. You want to scrape both because you want to observe the link between swimsuits prices and weather in order to anticipate the lower price to purchase.
Note in the pipelines.py
I will use mongo collection because this what I use, I never need SQL for the moment. If you don't know mongo, consider a collection is the equivalent of a table in relational database.
The scrapy project could look like the following:
spiderswebsites.py
, here you can write the number of spiders you want.
import scrapy
from ..items.py import SwimItem, WeatherItem
#if sometimes you have trouble to import from parent directory you can do
#import sys
#sys.path.append('/path/parentDirectory')
class SwimSpider(scrapy.Spider):
name = "swimsuit"
start_urls = ['https://www.swimsuit.com']
def parse (self, response):
price = response.xpath('span[@class="price"]/text()').extract()
model = response.xpath('span[@class="model"]/text()').extract()
... # and so on
item = SwimItem() #needs to be called -> ()
item['price'] = price
item['model'] = model
... # and so on
return item
class WeatherSpider(scrapy.Spider):
name = "weather"
start_urls = ['https://www.weather.com']
def parse (self, response):
temperature = response.xpath('span[@class="temp"]/text()').extract()
cloud = response.xpath('span[@class="cloud_perc"]/text()').extract()
... # and so on
item = WeatherItem() #needs to be called -> ()
item['temperature'] = temperature
item['cloud'] = cloud
... # and so on
return item
items.py
, here you can write the number of item patterns you want.
import scrapy
class SwimItem(scrapy.Item):
price = scrapy.Field()
stock = scrapy.Field()
...
model = scrapy.Field()
class WeatherItem(scrapy.Item):
temperature = scrapy.Field()
cloud = scrapy.Field()
...
pressure = scrapy.Field()
pipelines.py
, where I use Mongo
import pymongo
from .items import SwimItem,WeatherItem
from .spiders.spiderswebsites import SwimSpider , WeatherSpider
class ScrapePipeline(object):
def __init__(self, mongo_uri, mongo_db):
self.mongo_uri = mongo_uri
self.mongo_db = mongo_db
@classmethod #this is a decorator, that's a powerful tool in Python
def from_crawler(cls, crawler):
return cls(
mongo_uri=crawler.settings.get('MONGODB_URL'),
mongo_db=crawler.settings.get('MONGODB_DB', 'defautlt-test')
)
def open_spider(self, spider):
self.client = pymongo.MongoClient(self.mongo_uri)
self.db = self.client[self.mongo_db]
def close_spider(self, spider):
self.client.close()
def process_item(self, item, spider):
if isinstance(spider, SwimItem):
self.collection_name = 'swimwebsite'
elif isinstance(spider, WeatherItem):
self.collection_name = 'weatherwebsite'
self.db[self.collection_name].insert(dict(item))
So when you look at my example's project, you see the project is not depending at all of a pattern of item, because you can use several kind of items in the same projects.
In the pattern above, the advantage is you keep the same configurations in settings.py
if you need. But do not forget you can "custom" the command of your spider. If you want your spider to run a little differently from the default settings you can set as scrapy crawl spider -s DOWNLOAD_DELAY=35
instead of 25
that you wrote in settings.py
for instance.
Functional programming
Moreover here my example is light. In the reality you are rarely interested in the raw data. You need many treatments which represents a lot of lines. To increase readability of your code, you can create functions in modules. But be careful with spaghetti code.
functions.py
, customized module
from re import search
def cloud_temp(response): #for WeatherSpider
"""returns a tuple containing temperature and percentage of clouds"""
temperature = response.xpath('span[@class="temp"]/text()').extract() #returns a str as " 12°C"
cloud = response.xpath('span[@class="cloud_perc"]/text()').extract() #returns a str as "30%"
#treatments, you want to record it as integer
temperature = int(re.search(r'[0-9]+',temperature).group()) #returns int as 12
cloud = int(re.search(r'[0-9]+',cloud).group()) #returns int as 30
return (cloud,temperature)
it gives in spiders.py
import scrapy
from items.py import SwimItem, WeatherItem
from functions.py import *
...
class WeatherSpider(scrapy.Spider):
name = "weather"
start_urls = ['https://www.weather.com']
def parse (self, response):
cloud , temperature = cloud_temp(response) "this is shorter than the previous one
... # and so on
item = WeatherItem() #needs to be called -> ()
item['temperature'] = temperature
item['cloud'] = cloud
... # and so on
return item
Besides, it gives a considerable improvement in debugging. Let's say I want to make a scrapy shell session.
>>> scrapy shell https://www.weather.com
...
#I check in the sys path if the directory where my `functions.py` module is present.
>>> import sys
>>> sys.path #returns a list of paths
>>> #if the directory is not present
>>> sys.path.insert(0, '/path/directory')
>>> #then I can now import my module in this session, and test in the shell, while I modify in the file functions.py itself
>>> from functions.py import *
>>> cloud_temp(response) #checking if it returns what I want.
This is more comfortable than to copy and paste a piece of code. And because Python is a great programming language for functional programming you should benefit from it. This is why I told you "More generally, any pattern is valid if you limit the number of lines, improve readability, limit the bugs too." The more it is readable the more you limit the bugs. The less the number of lines you write (like to avoid to copy and paste the same treatment for different variable), the less you limit the bugs. Because when you correct in a function itself, you correct for all which depends on it.
So now, if you're not very comfortable with functional programming, I can understand you make several projects for different items patterns. You do with your current skills and improve them then improve your code with the time.