How to use temporary storage for objects in Django?

Question

I have a crawler in Django project which crawls thousands of urls. Crawling is performed every two hours. There are multiple requests per second which can slower the database.

This is a parse method from spider:

def parse(self, response):
    httpstatus = response.status
    url_obj = response.request.meta['url_obj']
    xpath = url_obj.xpath
    elements = response.selector.xpath(xpath + '/text()').extract()

    ... EXCEPTIONS ...

    Scan.objects.create(url=url, httpstatus=httpstatus,
                               price=price,
                               valid=True)

As you can see, I have to access database after every request (tens in second) but this database is used by users too. Moreover, I can't use these Scan objects in frontend before the whole scanning is done.

My idea is to create some kind of intermediary/temporary storage for newly created Scan objects and then, after scanning is done, move them to the main database.

How can I do that? Do you have any ideas?

https://docs.djangoproject.com/en/1.11/topics/cache/#memcached — Sachin, Sep 24 '17 at 20:44
@SachinKukreja How this can help? Moreover it is too much data to hold it in memory. Could you elaborate? — Milano, Sep 25 '17 at 15:48

score 0 · Answer 1 · answered Nov 22 '17 at 07:52

You could accumulate your Scan objects in a list and then bulk_create() them when ready to do so. This would drastically reduce the number of database hits.

scans = []

....

def parse(self, response):
    httpstatus = response.status
    url_obj = response.request.meta['url_obj']
    xpath = url_obj.xpath
    elements = response.selector.xpath(xpath + '/text()').extract()

    ... EXCEPTIONS ...

    scans.append(Scan(url=url, httpstatus=httpstatus,
                           price=price,
                           valid=True))

....

Scan.objects.bulk_create(scans)

How to use temporary storage for objects in Django?

1 Answers1