I have a crawler in Django project which crawls thousands of urls. Crawling is performed every two hours. There are multiple requests per second which can slower the database.
This is a parse method from spider:
def parse(self, response):
httpstatus = response.status
url_obj = response.request.meta['url_obj']
xpath = url_obj.xpath
elements = response.selector.xpath(xpath + '/text()').extract()
... EXCEPTIONS ...
Scan.objects.create(url=url, httpstatus=httpstatus,
price=price,
valid=True)
As you can see, I have to access database after every request (tens in second) but this database is used by users too. Moreover, I can't use these Scan
objects in frontend before the whole scanning is done.
My idea is to create some kind of intermediary/temporary storage for newly created Scan
objects and then, after scanning is done, move them to the main database.
How can I do that? Do you have any ideas?