3

Nowadays most databases support inserting multiple records into database in one run. That is much faster than inserting records one by one, because only one transaction is need. The SQL syntax is similar to this:

INSERT INTO tbl_name (a,b,c)
VALUES(1,2,3), (4,5,6), (7,8,9);

Right now I'm using Python Scrapy on a small project. I use its item pipeline to store scraped data into a database. However, the logic behind item pipeline is that the relevant method will be called on each item. So it will always insert a single item at a time. How can I collect like 100 items and insert them in one run?

Just a learner
  • 26,690
  • 50
  • 155
  • 234

1 Answers1

3

You need to refactor your pipeline into something like this:

class DatabasePipeline(object):

    def open_spider(self, spider):
        #Create database connection
        ...
        #create items list
        self.items = []

    def process_item(self,item,spider):
        self.items.append(item)
        if len(self.items)==100:
            #constuct SQL query to insert multiple records
            ...
            #execute query and clean self.items
            self.items = []
        return item

    def close_spider(self,spider):
        #insert remaining records
        if self.items:
            #constuct SQL query to insert multiple records
            ...
            #execute query
        #close database connection
Georgiy
  • 3,158
  • 1
  • 6
  • 18