Scrapy item enriching from multiple websites

Question

I implemented the following scenario with python scrapy framework:

class MyCustomSpider(scrapy.Spider):
    def __init__(self, name=None, **kwargs):
        super().__init__(name, **kwargs)
        self.days = getattr(self, 'days', None)

    def start_requests(self):
        start_url = f'https://some.url?days={self.days}&format=json'
        yield scrapy.Request(url=start_url, callback=self.parse)

    def parse(self, response):
        json_data = response.json() if response and response.status == 200 else None
        if json_data:
            for entry in json_data['entries']:
            yield self.parse_json_entry(entry)
        
            if 'next' in json_data and json_data['next'] != "":
                yield response.follow(f"https://some.url?days={self.days}&time={self.time}&format=json", self.parse)

    def parse_json_entry(self, entry):
        ...
        item = loader.load_item()
        return item

I upsert parsed items into a database in one of pipelines. I would like to add the following functionality:

before upserting the item I would like to read it's current shape from database
if the item does not exist in a database or it exists but has some field empty I need to make a call to another website (exact webaddress is established based on the item's contents), scrap it's contents, enrich my item based on this additional reading and only then save the item into a database. I would like to have this call also covered by scrapy framework in order to have the cache and other conveniences
if the item does exist in a database and it has appropriate fields filled in then just update the item's status based on the currently read data

How to implement point 2 in a scrapy-like way? Now I perform the call to another website just in one of pipelines after scrapping the item but in that way I do not employ scrapy for doing that. Is there any smart way of doing that (maybe with pipelines) or rather should I put all the code into one spider with all database reading/checks and callbacks there?

Best regards!

score 0 · Answer 1 · answered Jan 01 '23 at 11:18

0

I guess the best idea will be to upsert partially data in one spider/pipeline with some flag stating that it still needs adjustement. Then in another spider load data with the flag set on and perform e additional readings.

answered Jan 01 '23 at 11:18

Gandalf

155
1
12

Scrapy item enriching from multiple websites

1 Answers1