2

I have a crawler in Django project which crawls thousands of urls. Crawling is performed every two hours. There are multiple requests per second which can slower the database.

This is a parse method from spider:

def parse(self, response):
    httpstatus = response.status
    url_obj = response.request.meta['url_obj']
    xpath = url_obj.xpath
    elements = response.selector.xpath(xpath + '/text()').extract()

    ... EXCEPTIONS ...

    Scan.objects.create(url=url, httpstatus=httpstatus,
                               price=price,
                               valid=True)

As you can see, I have to access database after every request (tens in second) but this database is used by users too. Moreover, I can't use these Scan objects in frontend before the whole scanning is done.

My idea is to create some kind of intermediary/temporary storage for newly created Scan objects and then, after scanning is done, move them to the main database.

How can I do that? Do you have any ideas?

Milano
  • 18,048
  • 37
  • 153
  • 353

1 Answers1

0

You could accumulate your Scan objects in a list and then bulk_create() them when ready to do so. This would drastically reduce the number of database hits.

scans = []

....

def parse(self, response):
    httpstatus = response.status
    url_obj = response.request.meta['url_obj']
    xpath = url_obj.xpath
    elements = response.selector.xpath(xpath + '/text()').extract()

    ... EXCEPTIONS ...

    scans.append(Scan(url=url, httpstatus=httpstatus,
                           price=price,
                           valid=True))

....

Scan.objects.bulk_create(scans)
VMatić
  • 996
  • 2
  • 10
  • 18