I implemented the following scenario with python scrapy framework:
class MyCustomSpider(scrapy.Spider):
def __init__(self, name=None, **kwargs):
super().__init__(name, **kwargs)
self.days = getattr(self, 'days', None)
def start_requests(self):
start_url = f'https://some.url?days={self.days}&format=json'
yield scrapy.Request(url=start_url, callback=self.parse)
def parse(self, response):
json_data = response.json() if response and response.status == 200 else None
if json_data:
for entry in json_data['entries']:
yield self.parse_json_entry(entry)
if 'next' in json_data and json_data['next'] != "":
yield response.follow(f"https://some.url?days={self.days}&time={self.time}&format=json", self.parse)
def parse_json_entry(self, entry):
...
item = loader.load_item()
return item
I upsert parsed items into a database in one of pipelines. I would like to add the following functionality:
- before upserting the item I would like to read it's current shape from database
- if the item does not exist in a database or it exists but has some field empty I need to make a call to another website (exact webaddress is established based on the item's contents), scrap it's contents, enrich my item based on this additional reading and only then save the item into a database. I would like to have this call also covered by scrapy framework in order to have the cache and other conveniences
- if the item does exist in a database and it has appropriate fields filled in then just update the item's status based on the currently read data
How to implement point 2 in a scrapy-like way? Now I perform the call to another website just in one of pipelines after scrapping the item but in that way I do not employ scrapy for doing that. Is there any smart way of doing that (maybe with pipelines) or rather should I put all the code into one spider with all database reading/checks and callbacks there?
Best regards!